Stratified Sampling:
Allocation Methods

PUBHBIO 7225 Lecture 7

Outline

Topics

Allocation Methods for Stratified Sampling
- Equal Allocation
- Proportional Allocation
- Neyman Allocation
- Optimal Allocation

Activities

7.1 Comparing Allocation Methods

Readings

Valliant R, Dever J, and Kreuter F (2018) Practical Tools for Designing and Weighting Survey Samples, 2nd edition – Section 3.1.3 (link on Carmen)

Assignments

Problem Set 2 due Thursday 9/18/2025 11:59pm via Carmen

Quick Note About Minimum Sample Size Per Stratum

For each stratum, \(h=1, \dots, H\), we require \(n_h \ge 2\) (minimum of 2 units sampled per stratum)
Why is this true?

Stratified Sampling: Allocation Methods

Allocation Method = how the total sample size \(n\) is distributed across the strata

Most often you decide how many units total you can afford to sample \((n)\) and then split (“allocate”) that across the strata with one of four methods:

Equal Allocation – \(n_h\) the same for all strata
Proportional Allocation – Allocate sample to the strata proportional to the population size in each stratum
Neyman Allocation – Allocate sample to the strata according to the estimated variability of \(y\) and the population size in each stratum
Optimal Allocation – Neyman Allocation, but also accounting for the cost to sample each unit (may vary by strata)

Equal Allocation

Equal Allocation = Number of sampled units in each stratum is the same \[n_h = \frac{n}{H} \quad \forall ~ h\]

Often used if goal is to compare stratum means/proportions
- Want the same precision for each mean/proportion being compared
- Specify desired MOE for a stratum and calculate necessary \(n_h\) based on SRS formula

If stratum sizes \((N_h)\) are unequal:
- Probability of selection for a unit will differ by stratum
- Sampling weights will differ
- Not EPSEM!

Downside: estimate of the overall mean less precise than other allocation methods

Proportional Allocation

Proportional Allocation = Number of sampled units in each stratum is proportional to the
size of the stratum (relative to total population size) \[n_h = n \frac{N_h}{N}\]

Resulting sample is EPSEM (self-weighting, all weights equal)
- \(P(\)selection for unit \(j\) in stratum \(h)=\pi_{hj} = n_h/N_h = n/N\)
- Sample weights: \(w_{hj} = N/n\)

Formula for the variance of the sample mean, \(\bar{y}_{str}\), can thus be simplified: \[\begin{flalign} V_{prop}(\bar{y}_{str}) &= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h} = &% \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n \frac{N_h}{N}}{N_h}\right) \frac{S_h^2}{n \frac{N_h}{N}} \\ %&= \sum_{h=1}^H \left(\frac{N_h}{N}\right) \fpcpar \frac{S_h^2}{n} %\\ %&= \sum_{h=1}^H \left(\frac{N_h}{N}\right) \frac{S_h^2}{n} - \frac{1}{N} \sum_{h=1}^H \frac{N_h}{N} S_h^2 \end{flalign}\]

Activity 7.1 (Part 1)

Comparing Allocation Methods (Part 1)

Proportional Allocation vs. SRS

When the strata are large enough, the theoretical variance of \(\bar{y}_{str}\) under proportional allocation is at most as large as the variance of the sample mean from an SRS with the same number of observations.

The gain in precision (over SRS) will be larger when the \(\bar{y}_{hU}\) differ from \(\bar{y}_U\) more
- i.e., when stratum means/proportions are more different

Fact: when \(Y\) comes from a mixture of two distributions, \(V(Y)\) is a weighted average of the two variances \((\sigma^2_1,\sigma^2_2)\) plus an increase due to difference in the means \((\mu_1, \mu_2)\):

\[V(Y) = p_1 \sigma^2_1 + (1-p_1) \sigma^2_2 + p_1 (1-p_1) (\mu_1 - \mu_2)^2\]

where \(p_1\) = fraction of the population from first distribution

The further apart the means are, the bigger the variance of \(y\) if you ignore strata \((S^2)\)
Thus, stratification helps if the stratification variable(s) are related to the survey outcome \(\mathbf{y}\)

Mixture Distribution in a Picture

Nine histograms arranged in a 3×3 grid showing y values within two strata and overall and for different mean differences

Numerical Illustration

Equal-sized strata \((N_1=N_2)\), equal variance in each stratum, \(n=1000\), \(N\) large enough to ignore FPC

\(\mu_2\)	\(S_1^2\)	\(S_2^2\)	\(S^2\)	\(V_{srs}(\bar{y})\)	\(V_{prop}(\bar{y}_{str})\)	Ratio (SRS/Strat)
0	1	1	1	0.001	0.001	1
0.5	1	1	1.0625	0.0010625	0.001	1.0625
1	1	1	1.25	0.00125	0.001	1.25
1.5	1	1	1.5625	0.0015625	0.001	1.5625
2	1	1	2	0.002	0.001	2

Example calculations (2nd row of table above):

\(S^2 = \frac{N_1}{N}S_1^2 + \frac{N_2}{N}S_2^2 + \frac{N_1}{N}\frac{N_2}{N} (\mu_1-\mu_2)^2= 0.5(1) + 0.5(1) + (0.5)(0.5)(0-0.5)^2 = 1.0625\)

\(V_{srs}(\bar{y}) = \frac{S^2}{n} = \frac{1.0625}{1000} = 0.0010625\)

\(V_{str}(\bar{y}_{str}) = \frac{N_1}{N}\frac{S_1^2}{n} + \frac{N_2}{N}\frac{S_2^2}{n} = 0.5\left(\frac{1}{1000}\right) + 0.5\left(\frac{1}{1000}\right) = 0.001\)

Numerical Illustration: Proportions

What about proportions? Exact same story:

\(p_1\)	\(p_2\)	\(S_1^2\)	\(S_2^2\)	\(S^2\)	\(V(\bar{y})\)	\(V_{prop}(\bar{y}_{str})\)	Ratio (SRS/Stratified)
0.5	0.5	0.25	0.25	0.25	0.0025	0.0025	1
0.4	0.6	0.24	0.24	0.25	0.0025	0.0024	1.042
0.3	0.7	0.21	0.21	0.25	0.0025	0.0021	1.19
0.2	0.8	0.16	0.16	0.25	0.0025	0.0016	1.5625
0.1	0.9	0.09	0.09	0.25	0.0025	0.0009	2.78

These results are for the theoretical sampling variance (over repeated sampling)
In a single pair of samples (one SRS, one stratified) you might not see the benefit – especially if the stratified sample selected had large within-stratum sample variances

Also note that most surveys have more than one key outcome, \(y\)!
Compromise necessary to attempt to achieve improved precision across multiple \(y\)

Proportional Allocation vs. SRS: Theory

We can derive an expression for the comparison of stratified sampling with proportional allocation compared to SRS to prove the gain of stratified sampling.
To do this, we will again lean on ideas you have seen before in ANOVA – sums of squares!

\[\begin{flalign} \text{Total Sum of Squares} &= \text{Between Strata} + \text{Within Strata} & \\ \text{\small \emph{each obs.\ to overall mean}} &= \text{\small \emph{stratum mean to overall mean}} + \text{\small \emph{each obs to its stratum mean}} \\ \sum_{h=1}^H \sum_{j=1}^{N_h} (y_{hj}-\bar{y}_U)^2 &= \sum_{h=1}^H \sum_{j=1}^{N_h} (\bar{y}_{hU}-\bar{y}_U)^2 + \sum_{h=1}^H \sum_{j=1}^{N_h} (y_{hj}-\bar{y}_{hU})^2 \\ \text{\textcolor{red}{by definition of }} & \textcolor{red}{S^2, S_h^2, \text{ and simplifying a bit:}} \\ (N-1) S^2 &= \sum_{h=1}^H N_h (\bar{y}_{hU}-\bar{y}_U)^2 + \sum_{h=1}^H (N_h-1) S_h^2\\ \frac{(N-1)}{N} S^2 &= \sum_{h=1}^H \frac{N_h}{N} (\bar{y}_{hU}-\bar{y}_U)^2 + \sum_{h=1}^H \frac{N_h-1}{N} S_h^2 \end{flalign}\]

Proportional Allocation vs. SRS: Theory

\[\begin{flalign} \text{\textcolor{red}{assuming }} \textcolor{red}{(N-1)~} & \textcolor{red}{\approx N \text{ and } (N_h-1) \approx N_h:} & \\ S^2 &= \sum_{h=1}^H \frac{N_h}{N} (\bar{y}_{hU}-\bar{y}_U)^2 + \sum_{h=1}^H \frac{N_h}{N} S_h^2\\ \text{\textcolor{red}{now make that left }} & \textcolor{red}{\text{side look like the SRS variance by multiplying both sides by } \left(1-\frac{n}{N}\right)\frac{1}{n}:}\\ \left(1-\frac{n}{N}\right)\frac{1}{n} S^2 &= \left(1-\frac{n}{N}\right)\frac{1}{n} \sum_{h=1}^H \frac{N_h}{N} (\bar{y}_{hU}-\bar{y}_U)^2 + \left(1-\frac{n}{N}\right)\frac{1}{n} \sum_{h=1}^H \frac{N_h}{N} S_h^2\\ \left(1-\frac{n}{N}\right)\frac{S^2}{n} &= \left(1-\frac{n}{N}\right)\frac{1}{n} \sum_{h=1}^H \frac{N_h}{N} (\bar{y}_{hU}-\bar{y}_U)^2 + \sum_{h=1}^H \left(1-\frac{n}{N}\right)\left(\frac{N_h}{N}\right) \frac{S_h^2}{n}\\ V_{srs}(\bar{y}) &= \underbrace{\left(1-\frac{n}{N}\right)\frac{1}{n} \sum_{h=1}^H \frac{N_h}{N} (\bar{y}_{hU}-\bar{y}_U)^2}_{\geq 0} +~V_{prop}(\bar{y}_{str}) \\ & \textcolor{blue}{\rightarrow V_{prop}(\bar{y}_{str}) \leq V_{srs}(\bar{y})} \end{flalign}\]

Neyman Allocation

Neyman Allocation = allocation that minimizes \(V(\bar{y}_{str})\), the variance of the overall mean of \(y\), for a fixed total sample size, \(n\) \[n_h = n \left( \frac{N_h S_h}{\sum_{l =1}^H N_l S_l} \right)\]

Number of sampled units in each stratum is proportional to the size of the stratum \((N_h)\) times the standard deviation of \(y\) in the stratum \((S_h)\)
- I.e., \(n_h \propto N_h S_h\)
Sample more of a stratum if it has a large within-stratum variance – to compensate for the heterogeneity
First determined by Alexander Chuprov (1923), rediscovered by Neyman (1934). Poor Chuprov!

Math/Stat Note: The derivation of the \(n_h\) formula under Neyman Allocation uses the method of Lagrange multipliers (calculus) – which is outside the scope of this class – I encourage any interested students to refer to the additional handout on Carmen

Neyman Allocation: Variance Formula

Ugly algebra to simplify the variance of the sample mean under Neyman Allocation: \[\begin{flalign} V_{Neyman}(\bar{y}_{str}) &= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h} = \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \frac{S_h^2}{n_h} - \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \frac{S_h^2}{N_h} & \\ &= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \frac{S_h^2}{n \left( \frac{N_h S_h}{\sum_{l =1}^H N_l S_l} \right)} - \frac{1}{N^2} \sum_{h=1}^H N_h S_h^2\\ &= \frac{1}{n}\frac{1}{N^2} \sum_{h=1}^H N_h S_h \sum_{l =1}^H N_l S_l - \frac{1}{N^2} \sum_{h=1}^H N_h S_h^2 \\ &= \frac{1}{n}\frac{1}{N^2} \left(\sum_{h=1}^H N_h S_h\right)^2 - \frac{1}{N^2} \sum_{h=1}^H N_h S_h^2\\ &= \frac{1}{n} \left(\sum_{h=1}^H \frac{N_h}{N} S_h\right)^2 - \frac{1}{N} \sum_{h=1}^H \frac{N_h}{N} S_h^2 \end{flalign}\]

Activity 7.1 (Part 2)

Comparing Allocation Methods (Part 2)

Neyman Allocation vs. Proportional Allocation

If \(S_h\) are all equal and correctly specified, then Neyman allocation is the same as proportional allocation
If \(S_h\) are correctly specified, (theoretical) variance from Neyman allocation is always less than or equal to the (theoretical) variance from proportional allocation (proof next page)
- The gain in precision (over proportional allocation) will be larger when the \(S_h\) differ from \(\bar{S}\) more (when stratum standard deviations more different)

Combining with result about proportional allocation vs. SRS, assuming the \(S_h\) are correctly specified, we have that: \[V_{Neyman}(\bar{y}_{str}) \le V_{prop}(\bar{y}_{str}) \le V_{srs}(\bar{y}_{str})\]
Remember that these are the theoretical variances…in practice the estimated variances from two samples from the same population might not have this property

However, note that:
- If \(S_h\) are poorly (inaccurately) specified, Neyman allocation can result in higher variance than SRS
- Thus, Neyman allocation only used when we are confident in our estimates of \(S_h\)!

Neyman Allocation vs. Proportional Allocation: Theory

\[\begin{flalign} V_{prop}(\bar{y}_{str}) &- V_{Neyman}(\bar{y}_{str}) & \\ &= \left[ \sum_{h=1}^H \left(\frac{N_h}{N}\right) \frac{S_h^2}{n} - \frac{1}{N} \sum_{h=1}^H \frac{N_h}{N} S_h^2\right] - \left[\frac{1}{n} \left(\sum_{h=1}^H \frac{N_h}{N} S_h\right)^2 - \frac{1}{N} \sum_{h=1}^H \frac{N_h}{N} S_h^2 \right] \\ &= \sum_{h=1}^H \left(\frac{N_h}{N}\right) \frac{S_h^2}{n} - \frac{1}{n} \left(\sum_{h=1}^H \frac{N_h}{N} S_h\right)^2 \\ &= \frac{1}{n} \left[ \sum_{h=1}^H \frac{N_h}{N} S_h^2 - \left(\sum_{h=1}^H \frac{N_h}{N} S_h\right)^2 \right]\\ & \textcolor{red}{\text{define } \bar{S} = \sum_{h=1}^H\frac{N_h}{N} S_h = \text{weighted average SD}} \\ &= \frac{1}{n} \left[ \sum_{h=1}^H \frac{N_h}{N} S_h^2 - \bar{S}^2 \right] = \ldots \text{algebra}\ldots = \frac{1}{n}\sum_{h=1}^H \frac{N_h}{N} (S_h - \bar{S})^2 \ge 0 \\ & \textcolor{blue}{\rightarrow V_{Neyman}(\bar{y}_{str}) \leq V_{prop}(\bar{y})} \end{flalign}\]

Optimal Allocation

Optimal Allocation = allocation that minimizes \(V(\bar{y}_{str})\), the variance of the overall mean of \(y\), for a fixed total sample size, \(n\), and with a fixed cost per unit, \(c_h\) \[n_h = n \left( \frac{\frac{N_h S_h}{\sqrt{c_h}}}{\sum_{l =1}^H \frac{N_l S_l}{\sqrt{c_l}}} \right)\]

Number of sampled units in each stratum is proportional to the size of the stratum \((N_h)\), the standard deviation of \(y\) in the stratum \((S_h)\), and the reciprocal of the square root of the cost per unit of obtaining \(y\) in each stratum \((c_h)\)
Fixed total cost \(\displaystyle c = c_0 + \sum_{h=1}^H c_h n_h\), where \(c_0\) = baseline costs
Sample more of a stratum if it has a large within-stratum variance and if it is inexpensive
If costs per unit \((c_h)\) are same in all strata, optimal allocation = Neyman allocation

Optimal Allocation

For completeness, here is the “simplified” expression for the theoretical variance of the overall mean under optimal allocation:

\[\begin{aligned} V_{opt}(\bar{y}_{str}) &= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h} \\ &= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \frac{S_h^2}{n_h} - \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \frac{S_h^2}{N_h} \\ &= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \frac{S_h^2}{n \left( \frac{\frac{N_h S_h}{\sqrt{c_h}}}{\sum_{l =1}^H \frac{N_l S_l}{\sqrt{c_l}}} \right)} - \frac{1}{N^2} \sum_{h=1}^H N_h S_h^2\\ &= \frac{1}{n}\frac{1}{N^2} \sum_{h=1}^H N_h S_h \sqrt{c_h} \sum_{l =1}^H \frac{N_l S_l}{\sqrt{c_l}} - \frac{1}{N^2} \sum_{h=1}^H N_h S_h^2\\ %&= \frac{1}{n}\frac{1}{N^2} \sum_{h=1}^H \frac{N_h^2 S_h^2}{\left( \frac{\frac{N_h S_h}{\sqrt{c_h}}}{\sum_{l =1}^H \frac{N_l S_l}{\sqrt{c_l}}} \right)} - \frac{1}{N^2} \sum_{h=1}^H N_h S_h^2\\ %&= \frac{1}{n}\frac{1}{N^2} \left(\sum_{h=1}^H N_h S_h\right)^2 - \frac{1}{N^2} \sum_{h=1}^H N_h S_h^2 \end{aligned}\]

(Honestly, we will never use this formula)

Stratified Sampling: Allocation Methods

Outline

Quick Note About Minimum Sample Size Per Stratum

Stratified Sampling: Allocation Methods

Equal Allocation

Proportional Allocation

Activity 7.1 (Part 1)

Proportional Allocation vs. SRS

Mixture Distribution in a Picture

Numerical Illustration

Numerical Illustration: Proportions

Proportional Allocation vs. SRS: Theory

Proportional Allocation vs. SRS: Theory

Neyman Allocation

Neyman Allocation: Variance Formula

Activity 7.1 (Part 2)

Neyman Allocation vs. Proportional Allocation

Neyman Allocation vs. Proportional Allocation: Theory

Optimal Allocation

Optimal Allocation

Allocation-Related Topics We Are Not Covering

Stratified Sampling:
Allocation Methods