Hypothesis testing for a proportion

The one-sample z-test for a proportion evaluates whether a population proportion equals a specific hypothesized value. It relies on the normal approximation to the binomial, which requires checking that the sample is large enough before applying the test.

Hypotheses

Test \(H_0\) \(H_1\)
Two-sided \(p = p_0\) \(p \neq p_0\)
One-sided right \(p = p_0\) \(p > p_0\)
One-sided left \(p = p_0\) \(p < p_0\)

Test statistic

Given \(x\) successes in \(n\) trials, \(\hat{p} = x/n\). The test statistic is:

\[Z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}\]

Under \(H_0\), \(Z\) follows a standard normal distribution. Note that the denominator uses \(p_0\), not \(\hat{p}\): we compute the standard error under the null hypothesis, not under the observed proportion.

The normal approximation is valid when both:

\[n p_0 \geq 10 \quad \text{and} \quad n(1-p_0) \geq 10\]

⚠️ When the normal approximation fails, use exact methods

When \(np_0 < 10\) or \(n(1-p_0) < 10\) (rare events, small samples, or extreme proportions), the binomial distribution is skewed and the normal approximation is unreliable. In those cases:

  • Use the exact binomial test: binom.test() in R. It computes the exact p-value from the binomial distribution.
  • For confidence intervals, use the Wilson score interval rather than the Wald interval.

The condition \(np_0 \geq 10\) (not \(\geq 5\), as older textbooks state) is the current recommended threshold for the normal approximation.

Examples

Example 1: defect rate in manufacturing (two-sided)

A production line has a historical defect rate of \(p_0 = 0.08\). After a process change, a quality engineer inspects 200 units and finds 22 defective (\(\hat{p} = 0.110\)). Has the defect rate changed?

Check conditions: \(np_0 = 200 \times 0.08 = 16 \geq 10\) and \(n(1-p_0) = 184 \geq 10\). Normal approximation is valid.

Hypotheses: \(H_0: p = 0.08\) vs \(H_1: p \neq 0.08\).

Test statistic:

\[Z = \frac{0.110 - 0.08}{\sqrt{0.08 \times 0.92 / 200}} = \frac{0.030}{\sqrt{0.000368}} = \frac{0.030}{0.01918} \approx 1.564\]

p-value (two-sided):

\[p = 2 \times P(Z \geq 1.564) = 2 \times 0.059 = 0.118\]

Decision: \(p = 0.118 > 0.05\), fail to reject \(H_0\).

The observed increase from 8% to 11% defects is not statistically significant at the 5% level. However, the sample size may be insufficient to detect a change of this magnitude reliably. Power analysis would be warranted before concluding no change occurred.

Standard normal distribution with two-sided rejection regions showing the test statistic for the defect rate example

Example 2: conversion rate improvement (one-sided right)

An e-commerce platform’s current checkout conversion rate is \(p_0 = 0.32\). After a redesign, 148 out of 400 users complete a purchase (\(\hat{p} = 0.370\)). Is there evidence the redesign improved the conversion rate?

Check conditions: \(np_0 = 400 \times 0.32 = 128 \geq 10\) and \(n(1-p_0) = 272 \geq 10\). Normal approximation is valid.

Hypotheses: \(H_0: p = 0.32\) vs \(H_1: p > 0.32\).

Test statistic:

\[Z = \frac{0.370 - 0.32}{\sqrt{0.32 \times 0.68 / 400}} = \frac{0.050}{\sqrt{0.000544}} = \frac{0.050}{0.02333} \approx 2.144\]

p-value (one-sided right):

\[p = P(Z \geq 2.144) \approx 0.016\]

Decision: \(p = 0.016 < 0.05\), reject \(H_0\).

There is significant evidence at the 5% level that the redesign improved the conversion rate. The conversion rate increased by approximately 5 percentage points, a relative improvement of about 16%.

Standard normal distribution with right rejection region showing the test statistic for the conversion rate example

Running the test in R

# Example 1: two-sided
prop.test(x = 22, n = 200, p = 0.08, alternative = "two.sided", correct = FALSE)

# Example 2: one-sided right
prop.test(x = 148, n = 400, p = 0.32, alternative = "greater", correct = FALSE)

# Exact binomial test (for small samples)
binom.test(x = 22, n = 200, p = 0.08, alternative = "two.sided")

prop.test() uses the chi-squared approximation (equivalent to the z-test). correct = FALSE disables the Yates continuity correction, which is rarely needed with large samples.

💡 Interpreting the result

A significant result means the data are inconsistent with \(H_0: p = p_0\). Always report the effect size alongside the p-value: the difference \(\hat{p} - p_0\) and its confidence interval convey both statistical significance and practical importance. A change from 8% to 11% defects might be statistically non-significant with \(n=200\) but practically important; with \(n=2000\) it would be highly significant.