Stratified Sampling:
Sample Size and DEFF

PUBHBIO 7225 Lecture 9

Outline

Topics

  • Sample Size Estimation for Stratified Samples
  • Design Effect
  • Group Project Work (time permitting)

Activities

  • 9.1 Stratification: Sample Size and Design Effects


Assignments

  • Quiz 2 due Thursday 9/25/2025 11:59pm via Carmen
  • Group Progress Report due Thursday 9/25/2025 11:59pm via Carmen (only one group member needs to upload this)

Sample Size Estimation for a Stratified Sample

  • Want to find necessary overall sample size \(n\) that will produce a given MOE \((e)\)

\[\text{CI: } \bar{y}_{str}\pm \underbrace{z_{\alpha/2} \sqrt{V(\bar{y}_{str})}}_{e = \text{half-width of CI}}\]

  • Just as with SRS, the formula for the necessary sample size \(n\) is found by plugging in the formula for the variance and solving for \(n\).

  • But, you have to pick an allocation method first. Why?

  • Because \(V(\bar{y}_{str})\) depends on the \(\textcolor{blue}{n_h}\): \(V(\bar{y}_{str}) = \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left( 1 - \frac{\textcolor{blue}{n_h}}{N_h}\right) \frac{S_h^2}{\textcolor{blue}{n_h}}\)

    • Equal allocation: \(n_h = \frac{n}{H}\)

    • Proportional allocation: \(n_h = n \frac{N_h}{N}\)

    • Neyman allocation: \(n_h = n \left( \frac{N_h S_h}{\sum_{l =1}^H N_l S_l} \right)\)

Sample Size Estimation for Proportional Allocation

  • Simplified version of \(V(\bar{y}_{str})\) under proportional allocation (obtained by plugging in \(n_h = n \frac{N_h}{N}\)): \[V(\bar{y}_{str}) = \sum_{h=1}^H \frac{N_h}{N}\left(1-\frac{n}{N}\right)\frac{S_h^2}{n}\]

  • Plugging this into the general formula for MOE (previous slide): \[e = z_{\alpha/2} \sqrt{V(\bar{y}_{str})} = z_{\alpha/2} \sqrt{\sum_{h=1}^H \frac{N_h}{N}\left(1-\frac{n}{N}\right)\frac{S_h^2}{n}}\]

  • Square both sides: \[\begin{aligned} e^2 &{}= z_{\alpha/2}^2 \left[{\sum_{h=1}^H \frac{N_h}{N}\left(1-\frac{n}{N}\right)\frac{S_h^2}{n}}\right] \end{aligned}\]

Sample Size Estimation for Proportional Allocation

  • Do some painful algebraic manipulation: \[\begin{aligned} e^2 &= z_{\alpha/2}^2 \left[ \sum_{h=1}^H \frac{N_h}{N}\left(1-\frac{n}{N}\right)\frac{S_h^2}{n} \right] = z_{\alpha/2}^2 \left[ \sum_{h=1}^H \frac{N_h}{N} \frac{S_h^2}{n} - \sum_{h=1}^H \frac{N_h}{N} \frac{n}{N} \frac{S_h^2}{n} \right] \qquad \text{\small (expand the parentheses)} \\ e^2 &= z_{\alpha/2}^2 \left[\frac{1}{n} \sum_{h=1}^H \frac{N_h}{N} S_h^2 - \frac{1}{N} \sum_{h=1}^H \frac{N_h}{N} S_h^2 \right] \qquad \text{\small (pulling $n$ and $N$ outside summations)} \\ e^2 &= z_{\alpha/2}^2 \left(\frac{S^{\ast 2}}{n} - \frac{S^{\ast 2}}{N} \right)\quad \textcolor{red}{\text{\small setting } S^{\ast 2} = \sum_{h=1}^H \frac{N_h}{N} S_h^2 = \text{\small weighted average of stratum variances}} \\ & \ldots \text{ algebra to get $n$ on left side}\ldots \\ n &= \frac{z_{\alpha/2}^2 S^{\ast 2}}{e^2 + \dfrac{z_{\alpha/2}^2 S^{\ast 2}}{N}} = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h^2\right)}{e^2 + \dfrac{z_{\alpha/2}^2}{N} \left(\sum_{h=1}^H \frac{N_h}{N}S_h^2 \right)} \end{aligned}\]

Similar to SRS formula – with \(S^{\ast 2}\) instead of \(S^2\)

Example

  • In Activity 7.1 we had a population consisting of two strata:
Stratum Definition \(\mathbf{N_h}\) True Proportion \(\mathbf{V(y)}\) in Stratum
1 Age 18-64 750,000 \(p_1=0.3\) \(S_1^2 = 0.21\)
2 Age 65+ 250,000 \(p_2=0.6\) \(S_2^2 = 0.24\)
  • We want to use proportional allocation to achieve an MOE of 1% for a 95% CI

  • How large a sample do we need?

  • Necessary pieces:

    • \(e = 0.01\) (margin of error of 1%)

    • \(z_{\alpha/2} = 1.96\) (b/c 95% CI)

    • \(N\) = 1,000,000 and \(\frac{N_1}{N} = 0.75\) and \(\frac{N_2}{N}=0.25\)

    • \(\displaystyle S^{\ast 2} = \sum_{h=1}^H \frac{N_h}{N} S_h^2 = \frac{N_1}{N} S_1^2 + \frac{N_2}{N} S_2^2 = 0.75(0.21) + 0.25(0.24) = 0.2175\)
      (weighted average of the stratum-level variances)

Example (con’t)

  • Plug in the pieces: \[\begin{flalign} n & = \frac{z_{\alpha/2}^2 S^{\ast 2}}{e^2 + \dfrac{z_{\alpha/2}^2 S^{\ast 2}}{N}} = \frac{1.96^2 (0.2175)}{0.01^2 + \dfrac{1.96^2 (0.2175)}{1000000}} = \frac{0.835548}{0.0001+0.000000835548} = \frac{0.835548}{0.0001008355} & \\ & = \mathbf{8286.2} \end{flalign}\]

    • If ignore 2nd piece of denominator (ignore FPC), \(n\) = 8355.5
  • Splitting this total \(n\) between the two stratum :

    • Stratum 1: \(n_1 = n \times \frac{N_1}{N} = 8286.2 \times 0.75 = 6214.7 \rightarrow n_1 = 6215\)

    • Stratum 2: \(n_2 = n \times \frac{N_2}{N} = 8286.2 \times 0.25 = 2071.6 \rightarrow n_2 = 2072\)

  • Note that both \(n_h\) rounded both up – this makes the total sample size \(n = 6125 + 2072 = \mathbf{8287}\)

    • Would have to check the budget to be sure this was manageable.

Sample Size Estimation for Neyman Allocation

  • Same type of calculation, using the simplified version of \(V(\bar{y}_{str})\) under Neyman allocation: \[V(\bar{y}_{str}) = \frac{1}{n} \left(\sum_{h=1}^H \frac{N_h}{N} S_h\right)^2 - \frac{1}{N} \sum_{h=1}^H \frac{N_h}{N} S_h^2\]

  • Again, plugging into the MOE formula and solving for \(n\):

    \[\begin{aligned} e &= z_{\alpha/2} \sqrt{V(\bar{y}_{str})} = z_{\alpha/2} \sqrt{\frac{1}{n} \left(\sum_{h=1}^H \frac{N_h}{N} S_h\right)^2 - \frac{1}{N} \sum_{h=1}^H \frac{N_h}{N} S_h^2} \\ & \text{...square both sides, do a bunch of algebra ...} \\ n & = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h\right)^2}{e^2 + \dfrac{z_{\alpha/2}^2}{N}\left(\sum_{h=1}^H \frac{N_h}{N}S_h^2 \right)} = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h\right)^2}{e^2 + {\dfrac{z_{\alpha/2}^2 S^{\ast 2}}{N}}} \end{aligned}\]
  • Warning: Note the different pieces containing \(S_h\) in the numerator and denominator

Example

  • Same set-up as before:
Stratum Definition \(\mathbf{N_h}\) True Proportion \(\mathbf{V(y)}\) in Stratum
1 Age 18-64 750,000 \(p_1=0.3\) \(S_1^2 = 0.21\)
2 Age 65+ 250,000 \(p_2=0.6\) \(S_2^2 = 0.24\)
  • Now use Neyman allocation to achieve an MOE of 1% for a 95% CI

  • How large a sample do we need?

  • Necessary pieces:

    • \(e = 0.01\) (margin of error of 1%)

    • \(z_{\alpha/2} = 1.96\) (b/c 95% CI)

    • \(N\) = 1,000,000 and \(\frac{N_1}{N} = 0.75\) and \(\frac{N_2}{N}=0.25\)

    • \(S^{\ast 2} = 0.2175\) (weighted average of the stratum-level variances, as used for proportional allocation)

    • \(\left(\sum_{h=1}^H \frac{N_h}{N}S_h\right)^2 = \left(\frac{N_1}{N}S_1 + \frac{N_2}{N}S_2\right)^2 = (0.75\sqrt{0.21} + 0.25\sqrt{0.24})^2 = 0.4662^2 = 0.2173\)
      (weighted average of the stratum-level standard deviations, squared)

Example (con’t)

  • Plug in the pieces: \[\begin{aligned} n = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h\right)^2}{e^2 + {\dfrac{z_{\alpha/2}^2 S^{\ast 2}}{N}}} & = \frac{1.96^2 (0.2173)}{0.01^2 + \dfrac{1.96^2 (0.2175)}{1000000}} = \frac{0.8348}{0.0001+0.000000835548} = \mathbf{8278.9} \end{aligned}\]

  • Only very slightly smaller than proportional allocation! \((n=8286.2)\) – Why?

  • To determine allocation to each stratum, have to go back to allocation formula:

    • Stratum 1: \(n_1 = n \left( \frac{N_1 S_1}{N_1 S_1 + N_2 S_2} \right) = 8278.9 \left( \frac{750000 \sqrt{0.21}}{750000 \sqrt{0.21} + 250000 \sqrt{0.24}} \right) = 6103.8\)

    • Stratum 2: \(n_2 = n \left( \frac{N_2 S_2}{N_1 S_1 + N_2 S_2} \right) = 8278.9 \left( \frac{250000 \sqrt{0.24}}{750000 \sqrt{0.21} + 250000 \sqrt{0.24}} \right) = 2175.1\)

    • Round to \(n_1 = 6104\) and \(n_2=2175\)

  • Compared to proportional allocation \((n_1 = 6215, n_2 = 2072)\), with Neyman allocation we take more observations from the more variable stratum (\(h=2\), older age group, which has higher \(S_h^2\))

Compare to SRS

Useful to step back and compare the formulae to see driver of the differences:

Sampling Method Sample Size Formula Sample Size Ignoring FPC
SRS \(\displaystyle n = \frac{z_{\alpha/2}^2 S^2}{e^2 + \dfrac{z_{\alpha/2}^2 S^2}{N}}\) \(\displaystyle n = \frac{z_{\alpha/2}^2 S^2}{e^2}\)
Stratified, Proportional \(\displaystyle n = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h^2\right)}{e^2 + \dfrac{z_{\alpha/2}^2}{N} \left(\sum_{h=1}^H \frac{N_h}{N}S_h^2 \right)}\) \(\displaystyle n = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h^2\right)}{e^2}\)
Stratified, Neyman \(\displaystyle n = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h\right)^2}{e^2 + \dfrac{z_{\alpha/2}^2}{N}\left(\sum_{h=1}^H \frac{N_h}{N}S_h^2 \right)}\) \(\displaystyle n = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h\right)^2}{e^2}\)

Remember that these calculations all require some estimate/guess of variance(s)

Activity 9.1 (Part 1)

Stratification: Sample Size and Design Effects (Part 1)

Design Effect: deff

\[\text{deff}(\text{plan, statistic}) = \frac{V(\text{estimator from sampling plan})}{V(\text{estimator from SRS with same total sample size})}\]

  • Some history:

    • Cornfield (1951) – measure the efficiency of a sampling plan by the ratio of the variance that would be obtained from an SRS of \(k\) units to the variance obtained from the complex sampling plan with \(k\) units (article link)

    • Kish (1965) – turn that ratio upside-down and call it the design effect (deff)

  • It may be old, but the design effect is in (constant) use today

  • Note that the design effect is statistic dependent – different \(y\) and/or different statistic (e.g., mean vs. regression coefficient) will have different design effect for the same sampling plan

Design Effect: deff

  • In terms of variance for that specific statistic:

    • \(\text{deff} < 1 \rightarrow\) your design is better than SRS (in terms of variance/efficiency)

    • \(\text{deff} > 1 \rightarrow\) your design is worsethan SRS (in terms of variance/efficiency)

  • For a mean under stratified sampling with proportional allocation: \[\begin{aligned} \text{deff} = \frac{V_{prop}(\bar{y}_{str})}{V_{srs}(\bar{y})} &= \frac{\sum_{h=1}^H \frac{N_h}{N} \left(1-\frac{n}{N}\right)\frac{S_h^2}{n}}{\left(1-\frac{n}{N}\right)\frac{S^2}{n}} = \frac{\sum_{h=1}^H \frac{N_h}{N} S_h^2}{S^2} \\ & \textcolor{red}{\text{assuming }(N-1)/N \approx 1} \textcolor{red}{\text{ and } (N_h-1)/N_h \approx 1:}\\ &= \frac{\sum_{h=1}^H \frac{N_h}{N} S_h^2}{\sum_{h=1}^H \frac{N_h}{N} S_h^2 +\sum_{h=1}^H \frac{N_h}{N} (\bar{y}_{hU}-\bar{y}_U)^2} \le 1 \end{aligned}\]

  • deff for a stratified sample with proportional or Neyman allocation will usually be less than 1

    • Note that other allocation methods (e.g., equal allocation) might not give deff \(<1\)
  • Often deff \(>1\) for subgroup estimates that cross strata (domain means = subgroups that cross strata)

A Slightly Different Design Effect: deft

  • A similar design effect is sometimes used that compares the complex design to and uses instead of variances: \[\text{deft}(\text{plan, statistic}) = \frac{\textcolor{blue}{SE}(\text{estimator from sampling plan})}{\textcolor{blue}{SE}(\text{estimator from \textcolor{red}{SRSWR} with same total sample size})}\]

  • When the small sampling fraction (\(n/N\)) is very small, then \(\text{deft} \approx \sqrt{\text{deff}}\)

  • Why?

  • Because then variance under SRS is very close to variance under SRSWR

  • Design effects are often used for estimating sample sizes for surveys:

    • Have prior guess at the deff (e.g., from a prior survey)

    • Calculate the sample size assuming SRS and multiply by deff

    • Often used for complex designs involving clustering

Activity 9.1 (Part 2)

Stratification: Sample Size and Design Effects (Part 2)