Stratified Sampling:
Sample Size and DEFF
PUBHBIO 7225 Lecture 9
Topics
Activities
Assignments
\[\text{CI: } \bar{y}_{str}\pm \underbrace{z_{\alpha/2} \sqrt{V(\bar{y}_{str})}}_{e = \text{half-width of CI}}\]
Just as with SRS, the formula for the necessary sample size \(n\) is found by plugging in the formula for the variance and solving for \(n\).
But, you have to pick an allocation method first. Why?
Because \(V(\bar{y}_{str})\) depends on the \(\textcolor{blue}{n_h}\): \(V(\bar{y}_{str}) = \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left( 1 - \frac{\textcolor{blue}{n_h}}{N_h}\right) \frac{S_h^2}{\textcolor{blue}{n_h}}\)
Equal allocation: \(n_h = \frac{n}{H}\)
Proportional allocation: \(n_h = n \frac{N_h}{N}\)
Neyman allocation: \(n_h = n \left( \frac{N_h S_h}{\sum_{l =1}^H N_l S_l} \right)\)
Simplified version of \(V(\bar{y}_{str})\) under proportional allocation (obtained by plugging in \(n_h = n \frac{N_h}{N}\)): \[V(\bar{y}_{str}) = \sum_{h=1}^H \frac{N_h}{N}\left(1-\frac{n}{N}\right)\frac{S_h^2}{n}\]
Plugging this into the general formula for MOE (previous slide): \[e = z_{\alpha/2} \sqrt{V(\bar{y}_{str})} = z_{\alpha/2} \sqrt{\sum_{h=1}^H \frac{N_h}{N}\left(1-\frac{n}{N}\right)\frac{S_h^2}{n}}\]
Square both sides: \[\begin{aligned} e^2 &{}= z_{\alpha/2}^2 \left[{\sum_{h=1}^H \frac{N_h}{N}\left(1-\frac{n}{N}\right)\frac{S_h^2}{n}}\right] \end{aligned}\]
Similar to SRS formula – with \(S^{\ast 2}\) instead of \(S^2\)
| Stratum | Definition | \(\mathbf{N_h}\) | True Proportion | \(\mathbf{V(y)}\) in Stratum |
|---|---|---|---|---|
| 1 | Age 18-64 | 750,000 | \(p_1=0.3\) | \(S_1^2 = 0.21\) |
| 2 | Age 65+ | 250,000 | \(p_2=0.6\) | \(S_2^2 = 0.24\) |
We want to use proportional allocation to achieve an MOE of 1% for a 95% CI
How large a sample do we need?
Necessary pieces:
\(e = 0.01\) (margin of error of 1%)
\(z_{\alpha/2} = 1.96\) (b/c 95% CI)
\(N\) = 1,000,000 and \(\frac{N_1}{N} = 0.75\) and \(\frac{N_2}{N}=0.25\)
\(\displaystyle S^{\ast 2} = \sum_{h=1}^H \frac{N_h}{N} S_h^2 = \frac{N_1}{N} S_1^2 + \frac{N_2}{N} S_2^2 = 0.75(0.21) + 0.25(0.24) = 0.2175\)
(weighted average of the stratum-level variances)
Plug in the pieces: \[\begin{flalign} n & = \frac{z_{\alpha/2}^2 S^{\ast 2}}{e^2 + \dfrac{z_{\alpha/2}^2 S^{\ast 2}}{N}} = \frac{1.96^2 (0.2175)}{0.01^2 + \dfrac{1.96^2 (0.2175)}{1000000}} = \frac{0.835548}{0.0001+0.000000835548} = \frac{0.835548}{0.0001008355} & \\ & = \mathbf{8286.2} \end{flalign}\]
Splitting this total \(n\) between the two stratum :
Stratum 1: \(n_1 = n \times \frac{N_1}{N} = 8286.2 \times 0.75 = 6214.7 \rightarrow n_1 = 6215\)
Stratum 2: \(n_2 = n \times \frac{N_2}{N} = 8286.2 \times 0.25 = 2071.6 \rightarrow n_2 = 2072\)
Note that both \(n_h\) rounded both up – this makes the total sample size \(n = 6125 + 2072 = \mathbf{8287}\)
Same type of calculation, using the simplified version of \(V(\bar{y}_{str})\) under Neyman allocation: \[V(\bar{y}_{str}) = \frac{1}{n} \left(\sum_{h=1}^H \frac{N_h}{N} S_h\right)^2 - \frac{1}{N} \sum_{h=1}^H \frac{N_h}{N} S_h^2\]
Again, plugging into the MOE formula and solving for \(n\):
\[\begin{aligned} e &= z_{\alpha/2} \sqrt{V(\bar{y}_{str})} = z_{\alpha/2} \sqrt{\frac{1}{n} \left(\sum_{h=1}^H \frac{N_h}{N} S_h\right)^2 - \frac{1}{N} \sum_{h=1}^H \frac{N_h}{N} S_h^2} \\ & \text{...square both sides, do a bunch of algebra ...} \\ n & = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h\right)^2}{e^2 + \dfrac{z_{\alpha/2}^2}{N}\left(\sum_{h=1}^H \frac{N_h}{N}S_h^2 \right)} = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h\right)^2}{e^2 + {\dfrac{z_{\alpha/2}^2 S^{\ast 2}}{N}}} \end{aligned}\]Warning: Note the different pieces containing \(S_h\) in the numerator and denominator
| Stratum | Definition | \(\mathbf{N_h}\) | True Proportion | \(\mathbf{V(y)}\) in Stratum |
|---|---|---|---|---|
| 1 | Age 18-64 | 750,000 | \(p_1=0.3\) | \(S_1^2 = 0.21\) |
| 2 | Age 65+ | 250,000 | \(p_2=0.6\) | \(S_2^2 = 0.24\) |
Now use Neyman allocation to achieve an MOE of 1% for a 95% CI
How large a sample do we need?
Necessary pieces:
\(e = 0.01\) (margin of error of 1%)
\(z_{\alpha/2} = 1.96\) (b/c 95% CI)
\(N\) = 1,000,000 and \(\frac{N_1}{N} = 0.75\) and \(\frac{N_2}{N}=0.25\)
\(S^{\ast 2} = 0.2175\) (weighted average of the stratum-level variances, as used for proportional allocation)
\(\left(\sum_{h=1}^H \frac{N_h}{N}S_h\right)^2 = \left(\frac{N_1}{N}S_1 + \frac{N_2}{N}S_2\right)^2 = (0.75\sqrt{0.21} + 0.25\sqrt{0.24})^2 = 0.4662^2 = 0.2173\)
(weighted average of the stratum-level standard deviations, squared)
Plug in the pieces: \[\begin{aligned} n = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h\right)^2}{e^2 + {\dfrac{z_{\alpha/2}^2 S^{\ast 2}}{N}}} & = \frac{1.96^2 (0.2173)}{0.01^2 + \dfrac{1.96^2 (0.2175)}{1000000}} = \frac{0.8348}{0.0001+0.000000835548} = \mathbf{8278.9} \end{aligned}\]
Only very slightly smaller than proportional allocation! \((n=8286.2)\) – Why?
To determine allocation to each stratum, have to go back to allocation formula:
Stratum 1: \(n_1 = n \left( \frac{N_1 S_1}{N_1 S_1 + N_2 S_2} \right) = 8278.9 \left( \frac{750000 \sqrt{0.21}}{750000 \sqrt{0.21} + 250000 \sqrt{0.24}} \right) = 6103.8\)
Stratum 2: \(n_2 = n \left( \frac{N_2 S_2}{N_1 S_1 + N_2 S_2} \right) = 8278.9 \left( \frac{250000 \sqrt{0.24}}{750000 \sqrt{0.21} + 250000 \sqrt{0.24}} \right) = 2175.1\)
Round to \(n_1 = 6104\) and \(n_2=2175\)
Compared to proportional allocation \((n_1 = 6215, n_2 = 2072)\), with Neyman allocation we take more observations from the more variable stratum (\(h=2\), older age group, which has higher \(S_h^2\))
Useful to step back and compare the formulae to see driver of the differences:
| Sampling Method | Sample Size Formula | Sample Size Ignoring FPC |
|---|---|---|
| SRS | \(\displaystyle n = \frac{z_{\alpha/2}^2 S^2}{e^2 + \dfrac{z_{\alpha/2}^2 S^2}{N}}\) | \(\displaystyle n = \frac{z_{\alpha/2}^2 S^2}{e^2}\) |
| Stratified, Proportional | \(\displaystyle n = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h^2\right)}{e^2 + \dfrac{z_{\alpha/2}^2}{N} \left(\sum_{h=1}^H \frac{N_h}{N}S_h^2 \right)}\) | \(\displaystyle n = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h^2\right)}{e^2}\) |
| Stratified, Neyman | \(\displaystyle n = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h\right)^2}{e^2 + \dfrac{z_{\alpha/2}^2}{N}\left(\sum_{h=1}^H \frac{N_h}{N}S_h^2 \right)}\) | \(\displaystyle n = \frac{z_{\alpha/2}^2 \left(\sum_{h=1}^H \frac{N_h}{N}S_h\right)^2}{e^2}\) |
Remember that these calculations all require some estimate/guess of variance(s)
Stratification: Sample Size and Design Effects (Part 1)
\[\text{deff}(\text{plan, statistic}) = \frac{V(\text{estimator from sampling plan})}{V(\text{estimator from SRS with same total sample size})}\]
Some history:
Cornfield (1951) – measure the efficiency of a sampling plan by the ratio of the variance that would be obtained from an SRS of \(k\) units to the variance obtained from the complex sampling plan with \(k\) units (article link)
Kish (1965) – turn that ratio upside-down and call it the design effect (deff)
It may be old, but the design effect is in (constant) use today
Note that the design effect is statistic dependent – different \(y\) and/or different statistic (e.g., mean vs. regression coefficient) will have different design effect for the same sampling plan
In terms of variance for that specific statistic:
\(\text{deff} < 1 \rightarrow\) your design is better than SRS (in terms of variance/efficiency)
\(\text{deff} > 1 \rightarrow\) your design is worsethan SRS (in terms of variance/efficiency)
For a mean under stratified sampling with proportional allocation: \[\begin{aligned} \text{deff} = \frac{V_{prop}(\bar{y}_{str})}{V_{srs}(\bar{y})} &= \frac{\sum_{h=1}^H \frac{N_h}{N} \left(1-\frac{n}{N}\right)\frac{S_h^2}{n}}{\left(1-\frac{n}{N}\right)\frac{S^2}{n}} = \frac{\sum_{h=1}^H \frac{N_h}{N} S_h^2}{S^2} \\ & \textcolor{red}{\text{assuming }(N-1)/N \approx 1} \textcolor{red}{\text{ and } (N_h-1)/N_h \approx 1:}\\ &= \frac{\sum_{h=1}^H \frac{N_h}{N} S_h^2}{\sum_{h=1}^H \frac{N_h}{N} S_h^2 +\sum_{h=1}^H \frac{N_h}{N} (\bar{y}_{hU}-\bar{y}_U)^2} \le 1 \end{aligned}\]
deff for a stratified sample with proportional or Neyman allocation will usually be less than 1
Often deff \(>1\) for subgroup estimates that cross strata (domain means = subgroups that cross strata)
A similar design effect is sometimes used that compares the complex design to and uses instead of variances: \[\text{deft}(\text{plan, statistic}) = \frac{\textcolor{blue}{SE}(\text{estimator from sampling plan})}{\textcolor{blue}{SE}(\text{estimator from \textcolor{red}{SRSWR} with same total sample size})}\]
When the small sampling fraction (\(n/N\)) is very small, then \(\text{deft} \approx \sqrt{\text{deff}}\)
Why?
Because then variance under SRS is very close to variance under SRSWR
Design effects are often used for estimating sample sizes for surveys:
Have prior guess at the deff (e.g., from a prior survey)
Calculate the sample size assuming SRS and multiply by deff
Often used for complex designs involving clustering
Stratification: Sample Size and Design Effects (Part 2)
PUBHBIO 7225