Stratified Sampling

PUBHBIO 7225 Lecture 6

Outline

Topics

Stratified Sampling
Design-based Theory for Stratified Samples
Sampling Weights

Activities

6.1 Construct a Stratified Sample Estimator

Assignments

Problem Set 2 due Thursday 9/18/2025 11:59pm via Carmen

What is Stratified Sampling?

The process:

STEP 1: Divide the population into \(H\) subpopulations, called strata

Strata do not overlap (disjoint/mutually exclusive)
Each sampling unit belongs to exactly one stratum (mutually exhaustive)
Strata are identified before sampling (i.e., available on the frame)

STEP 2: Draw an independent probability sample from each stratum (could be an SRS, or other design), then pool the information to obtain overall population estimates

Observations within strata tend to be more homogeneous than observations in the population as a whole
Reduced variance within strata often leads to a reduced variance for the population-level estimate
Thus we want the stratification variable(s) to be related to the \(y\) variable(s) of interest – want to get different means of \(y\) across strata

Throughout this lecture we will assume an SRS is taken in each stratum

Why Stratify?

Reasons to stratify include:

Protect against (small) chance of getting a really bad sample
- Example: sampling \(n=200\) OSU undergraduates
- If taking an SRS, you might (by chance) get 0 students from outside Ohio
- Stratifying by in-state/out-of-state guarantees this doesn’t happen
Want to ensure specific precision for certain subgroups
- Example: want to make sure precision is similar for in-state and out-of-state students, even though number in the population differs
May be a cost benefit (easier to administer survey in some strata)
Lower variance for overall estimates (e.g, estimates for \(\bar{y}_U\))
- Variance of \(y\) within each stratum is often lower than the variance of \(y\) in the whole population
- Can lead to cost savings because you can take a smaller sample

Activity 6.1 (Part 1)

Construct a Stratified Sample Estimator (Part 1)

Constructing an Estimator from a Stratified Sample

You took a sample of size \(n=15\) from a population of size \(N=35\)
But each of the 15 sampled units doesn’t “represent” the same number of population units

Stratum	Units in Population	Units in Sample	Each Sampled Unit “Represents”:
Stratum A	15	5	15/5 = 3
Stratum B	11	5	11/5 = 2.2
Stratum C	9	5	9/5 = 1.8

This sample is not EPSEM – simple unweighted average of the sampled \(y\) will be biased
Larger strata are “under-represented” in the sample
- 43% of the population comes from Stratum A (15/35), but
- 33% of the sample comes from Stratum A (5/15)!
We need to “upweight” the units that “represent” more units

Notation for Stratified Sampling

\(U\) = the finite population
\(N\) = size of the population
\(H\) = number of strata
\(N_h\) = number of units in stratum \(h\) in the population, with \(\displaystyle \sum_{h=1}^H N_h = N\)
\(\mathcal{S}_h\) = a particular sample in stratum \(h\), with \(\displaystyle \bigcup_{h=1}^H \mathcal{S}_h = \mathcal{S}\)
\(n_h\) = size of the sample in stratum \(h\), with \(\displaystyle \sum_{h=1}^H n_h = n\)
\(y_{hj}\) = characteristic of interest for the \(j\)th unit in the \(h\)th stratum

Example (from Activity)

\(U\) = population (collection of all the units)
\(N=35\) (total number of units in the population)
\(H=3\) strata (A, B, C)

Stratum	Population Size	Sample Size
A	\(N_1 = 15\)	\(n_1=5\)
B	\(N_2 = 11\)	\(n_2=5\)
C	\(N_3 = 9\)	\(n_3=5\)
Total	\(N=N_1+N_2+N_3=35\)	\(n=n_1+n_2+n_3=15\)

Examples of \(y_{hj}\):
- value of \(y\) for unit \(j=2\) in stratum \(h=1\) (stratum A) → \(y_{12} = 9\)
- value of \(y\) for unit \(j=6\) in stratum \(h=2\) (stratum B) → \(y_{26} = 6\)
- value of \(y\) for unit \(j=1\) in stratum \(h=3\) (stratum C) → \(y_{31} = 11\)

Estimates (Step 1): Stratum-Specific Estimates

Quantity	Estimand (Truth)	Estimate
Stratum Mean	\(\displaystyle \bar{y}_{hU}=\frac{1}{N_h} \sum_{j=1}^{N_h} y_{hj}\)	\(\displaystyle \bar{y}_h = \frac{1}{n_h} \sum_{j\in \mathcal{S}_h} y_{hj}\)
Stratum Total	\(\displaystyle t_h = \sum_{j=1}^{N_h} y_{hj}= N_h \bar{y}_{hU}\)	\(\displaystyle \hat{t}_h = N_h \bar{y}_h = \frac{N_h}{n_h} \sum_{j\in \mathcal{S}_h} y_{hj}\)
Stratum Variance (of \(y\))	\(\displaystyle S_h^2=\frac{1}{N_h-1} \sum_{j=1}^{N_h} (y_{hj}- \bar{y}_{hU})^2\)	\(\displaystyle s_h^2=\frac{1}{n_h-1} \sum_{j\in \mathcal{S}_h} (y_{hj}- \bar{y}_h)^2\)
Stratum Proportion	\(\displaystyle p_{hU}=\bar{y}_{hU}=\frac{1}{N_h} \sum_{j=1}^{N_h} y_{hj}\)	\(\displaystyle \hat{p}_h = \frac{1}{n_h} \sum_{j\in \mathcal{S}_h} y_{hj}\)
Stratum Variance (of binary \(y\))	\(\displaystyle S_h^2 = \frac{N_h}{N_h-1} p_{hU}(1-p_{hU})\)	\(\displaystyle s_h^2 = \frac{n_h}{n_h-1} \hat{p}_{h}(1-\hat{p}_{h})\)

These are just the SRS formulae, with extra \(h\) subscript (and \(j\) instead of \(i\) for unit)]

Example (from Activity)

Suppose my activity sample was:

A table with data from three strata (A, B, and C). Rows with yellow highlighting indicate a stratified random sample. Stratum A has a sample of 5 rows, Stratum B has a sample of 5, and Stratum C has a sample of 5.

We can then calculate/estimate:

Stratum A \((h=1)\): \(N_1 = 15, n_1=5\)
Stratum A population mean: \[\begin{aligned} %\bar{y}_{hU} & =\frac{1}{N_h} \sum_{j=1}^{N_h} y_{hj}\\ \bar{y}_{1U} &=\frac{1}{N_1} \sum_{j=1}^{N_1} y_{1j} = \frac{1}{15} \left(6+9+3+9+\cdots+5\right) = \textcolor{red}{\bf 4.4} \end{aligned}\]

Stratum A sample mean: \[\begin{aligned} %\bar{y}_h &= \frac{1}{n_h} \sum_{j\in \mathcal{S}_h} y_{hj}\\ \bar{y}_1 &= \frac{1}{n_1} \sum_{j\in \mathcal{S}_1} y_{1j} = \frac{1}{5} \left(9+10+1+3+5\right) = \textcolor{blue}{\bf 5.6} \end{aligned}\]

My estimate of the mean from my sample is 5.6; close to the true mean of 4.4

Not new formula – just the SRS formulae

Example (from Activity)

Stratum A population variance of \(y\):

\[\begin{aligned} S_h^2&=\frac{1}{N_h-1} \sum_{j=1}^{N_h} (y_{hj}- \bar{y}_{hU})^2 \\ S_1^2&=\frac{1}{N_1-1} \sum_{j=1}^{N_1} (y_{1j} - \bar{y}_{1U})^2 \\ &= \frac{1}{15-1} \sum_{j=1}^{15} (y_{1j} - 4.4)^2 \\ & = \frac{1}{15-1} \left[ (6-4.4)^2 + \cdots + (5-4.4)^2 \right] \\ &= \textcolor{red}{\bf 10.83} \end{aligned}\]

Stratum A sample variance of \(y\):

\[\begin{aligned} s_h^2&=\frac{1}{n_h-1} \sum_{j\in \mathcal{S}_h} (y_{hj}- \bar{y}_h)^2 \\ s_1^2&=\frac{1}{n_1-1} \sum_{j\in \mathcal{S}_h} (y_{1j} - \bar{y}_1)^2 \\ &= \frac{1}{5-1} \sum_{j\in \mathcal{S}_h} (y_{1j} - 5.6)^2 \\ &= \frac{1}{5-1}\left[ (9-5.6)^2 + \cdots + (5-5.6)^2 \right] \\ &= \textcolor{blue}{\bf 14.8} \end{aligned}\]

My estimate of \(V(y)\) from my sample is 14.8; close-ish to the true variance of 10.8

Again, not new formula – just the SRS formulae

Example (from Activity)

Similar calculations for the other 2 strata yield:

Means:

Stratum	Population	My Sample
Stratum A \((h=1)\)	\(\bar{y}_{1U} = 4.4\)	\(\bar{y}_1 = 5.6\)
Stratum B \((h=2)\)	\(\bar{y}_{2U} = 8.36\)	\(\bar{y}_2 = 7.8\)
Stratum C \((h=3)\)	\(\bar{y}_{3U} = 11.22\)	\(\bar{y}_3 = 11.8\)

Variances: (variance of \(y\), not of the mean of \(y\))

Stratum	Population	My Sample
Stratum A \((h=1)\)	\(S_1^2 = 10.83\)	\(s_1^2 = 14.8\)
Stratum B \((h=2)\)	\(S_2^2 = 6.45\)	\(s_2^2 = 6.7\)
Stratum C \((h=3)\)	\(S_3^2 = 7.44\)	\(s_3^2 = 6.7\)

Next step will be to combine the stratum estimates to get overall estimate (of the mean)

Activity 6.1 (Part 2)

Construct a Stratified Sample Estimator (Part 2)

Estimates (Step 2): Combining Stratum Estimates

Then combine stratum-specific estimates to get overall estimates:

Quantity	Estimand (Truth)	Estimate
Overall Total	\(\displaystyle t = \sum_{h=1}^H \sum_{j=1}^{N_h} y_{hj}= \sum_{h=1}^H t_h = \sum_{h=1}^H N_h \bar{y}_{hU}\)	\(\displaystyle \hat{t}_{str}= \sum_{h=1}^H \hat{t}_h= \sum_{h=1}^H N_h \bar{y}_h\)
Overall Mean	\(\displaystyle \bar{y}_U = \frac{t}{N} = \frac{1}{N} \sum_{h=1}^H N_h \bar{y}_{hU} = \sum_{h=1}^H \frac{N_h}{N} \bar{y}_{hU}\)	\(\displaystyle \bar{y}_{str}= \sum_{h=1}^H \frac{N_h}{N} \bar{y}_h\)
Overall Proportion	\(\displaystyle p = \bar{y}_U = \sum_{h=1}^H \frac{N_h}{N} p_{hU}\)	\(\displaystyle \hat{p}_{str}= \sum_{h=1}^H \frac{N_h}{N} \hat{p}_h\)

Intuition:

Total: sum up the stratum totals
Means/Proportions: weighted average of stratum quantities, weighting by the proportion of the population in each stratum

Example (from Activity)

			Population		My Sample
			Mean	Variance	Mean	Variance
Stratum A	\(N_1=15\)	\(n_1=5\)	\(\bar{y}_{1U} = 4.4\)	\(S_1^2 = 10.83\)	\(\bar{y}_1 = 5.6\)	\(s_1^2 = 14.8\)
Stratum B	\(N_2=11\)	\(n_2=5\)	\(\bar{y}_{2U} = 8.36\)	\(S_2^2 = 6.45\)	\(\bar{y}_2 = 7.8\)	\(s_2^2 = 6.7\)
Stratum C	\(N_3=9\)	\(n_3=5\)	\(\bar{y}_{3U} = 11.22\)	\(S_3^2 = 7.44\)	\(\bar{y}_3 = 11.8\)	\(s_3^2 = 6.7\)

Overall population mean: \[\bar{y}_U = \sum_{h=1}^H \frac{N_h}{N} \bar{y}_{hU} = \frac{N_1}{N} \bar{y}_{1U} + \frac{N_2}{N} \bar{y}_{2U} + \frac{N_3}{N} \bar{y}_{3U} = \frac{15}{35} (4.4) + \frac{11}{35} (8.36) + \frac{9}{35} (11.22) = \textcolor{red}{\bf 7.4}\]

My sample estimate of overall mean: \[\begin{aligned} \bar{y}_{str}= \sum_{h=1}^H \frac{N_h}{N} \bar{y}_h &= \frac{N_1}{N} \bar{y}_1 + \frac{N_2}{N} \bar{y}_2 + \frac{N_3}{N} \bar{y}_3 = \frac{15}{35} (5.6) + \frac{11}{35} (7.8) + \frac{9}{35} (11.8) = \textcolor{blue}{\bf 7.89} \\ \end{aligned}\]

Expectation and Variance of the Estimated Mean

Since we are taking an SRS within each stratum, stratum-specific estimates are unbiased: \[E(\bar{y}_h) = \bar{y}_{hU} \quad\quad\quad E(\hat{t}_h) = t_{hU} \quad\quad\quad E(\hat{p}_h) = p_{hU}\]
This implies that: \[\begin{flalign} E(\bar{y}_{str}) &= E\left(\sum_{h=1}^H \frac{N_h}{N} \bar{y}_h \right) = &% \sum_{h=1}^H \frac{N_h}{N} E(\ybar_h) = \sum_{h=1}^H \frac{N_h}{N} \ybar_{hU} = \ybar_U \end{flalign}\]
Samples are taken independently in each stratum.
Remember that if \(X\) and \(Y\) are independent, \(V(X+Y) = V(X)+V(Y)\). Thus: \[\begin{flalign} V(\bar{y}_{str}) &= V\left(\sum_{h=1}^H \frac{N_h}{N} \bar{y}_h \right) =&% = \sum_{h=1}^H V \left( \frac{N_h}{N} \ybar_h \right) = \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 V(\ybar_h) = \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \fpchpar \frac{S_h^2}{n_h} \end{flalign}\]

Estimators are Random Variables

Estimate	Expected Value	Variance	Formula
Mean: \(\bar{y}_{str}\)	\(E(\bar{y}_{str})= \bar{y}_U\)	Truth:	\(\displaystyle V(\bar{y}_{str})= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h}\)
		Estimate:	\(\displaystyle \widehat{V}(\bar{y}_{str})= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{s_h^2}{n_h}\)
Total: \(\hat{t}_{str}\)	\(E(\hat{t}_{str})=t\)	Truth:	\(\displaystyle V(\hat{t}_{str})=\sum_{h=1}^H N_h^2 \left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h}\)
		Estimate:	\(\displaystyle \widehat{V}(\hat{t}_{str})=\sum_{h=1}^H N_h^2 \left(1-\frac{n_h}{N_h}\right)\frac{s_h^2}{n_h}\)
Proportion: \(\hat{p}_{str}\)	\(E(\hat{p}_{str})=p\)	Truth:	\(\displaystyle V(\hat{p}_{str})= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(\frac{N_h-n_h}{N_h-1}\right) \frac{p_{hU}(1-p_{hU})}{n_h}\)
		Estimate:	\(\displaystyle \widehat{V}(\hat{p}_{str}) = \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{\hat{p}_h(1-\hat{p}_h)}{n_h-1}\)

Example (from Activity)

			Population		My Sample
			Mean	Variance	Mean	Variance
Stratum A	\(N_1=15\)	\(n_1=5\)	\(\bar{y}_{1U} = 4.4\)	\(S_1^2 = 10.83\)	\(\bar{y}_1 = 5.6\)	\(s_1^2 = 14.8\)
Stratum B	\(N_2=11\)	\(n_2=5\)	\(\bar{y}_{2U} = 8.36\)	\(S_2^2 = 6.45\)	\(\bar{y}_2 = 7.8\)	\(s_2^2 = 6.7\)
Stratum C	\(N_3=9\)	\(n_3=5\)	\(\bar{y}_{3U} = 11.22\)	\(S_3^2 = 7.44\)	\(\bar{y}_3 = 11.8\)	\(s_3^2 = 6.7\)

True sampling variance of \(\bar{y}_{str}\): \[\begin{aligned} V(\bar{y}_{str}) &= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h} \\ & = \left(\frac{15}{35}\right)^2 \left(1 - \frac{5}{15} \right) \frac{10.83}{5} + \left(\frac{11}{35}\right)^2 \left(1 - \frac{5}{11} \right) \frac{6.45}{5} + \left(\frac{9}{35}\right)^2 \left(1 - \frac{5}{9} \right) \frac{7.44}{5} = \textcolor{red}{0.378} \end{aligned}\]

Example (from Activity)

			Population		My Sample
			Mean	Variance	Mean	Variance
Stratum A	\(N_1=15\)	\(n_1=5\)	\(\bar{y}_{1U} = 4.4\)	\(S_1^2 = 10.83\)	\(\bar{y}_1 = 5.6\)	\(s_1^2 = 14.8\)
Stratum B	\(N_2=11\)	\(n_2=5\)	\(\bar{y}_{2U} = 8.36\)	\(S_2^2 = 6.45\)	\(\bar{y}_2 = 7.8\)	\(s_2^2 = 6.7\)
Stratum C	\(N_3=9\)	\(n_3=5\)	\(\bar{y}_{3U} = 11.22\)	\(S_3^2 = 7.44\)	\(\bar{y}_3 = 11.8\)	\(s_3^2 = 6.7\)

Estimate of sampling variance of \(\bar{y}_{str}\): \[\begin{aligned} \widehat{V}(\bar{y}_{str}) &= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{s_h^2}{n_h} \\ & = \left(\frac{15}{35}\right)^2 \left(1 - \frac{5}{15} \right) \frac{14.8}{5} + \left(\frac{11}{35}\right)^2 \left(1 - \frac{5}{11} \right) \frac{6.7}{5} + \left(\frac{9}{35}\right)^2 \left(1 - \frac{5}{9} \right) \frac{6.7}{5} = \textcolor{blue}{0.474} \end{aligned}\]

Activity 6.1 (Part 3)

Construct a Stratified Sample Estimator (Part 3)

Simulated Data from Activity

I repeated the activity 10,000 times (using R!), taking two types of samples:
1. SRS of size \(n=15\)
- \(\bar{y}=\) unweighted average of the 15 sampled units (SRS estimator)
1. Stratified sample of \(n_h=5\) in each stratum (total of 15 units sampled)
- \(\bar{y}_{str}=\) weighted average of the stratum means (stratified sample estimator)
Resulting estimates of \(\bar{y}_U\) (averaged over replicates):

Two histograms, one on top of the other, showing the normal-shaped distributions of the mean from an SRS (top) and the mean from a stratified sample (bottom)

Method	Mean	Variance of Mean
SRS \((\bar{y})\)	7.3991	0.6137
Stratified \((\bar{y}_{str})\)	7.4044	0.3805
Truth \((\bar{y}_U)\)	7.4

Stratified estimator: unbiased and more precise!

Confidence Intervals for Stratified Samples

If the size of the strata (\(n_h\)) are all large, or the number of strata (\(H\)) is large, then an approximately \(\alpha\)-level confidence interval for the true mean \(\bar{y}_U\) is given by: \[\left( \bar{y}_{str}- z_{\alpha/2} \sqrt{\widehat{V}(\bar{y}_{str})}, \quad \bar{y}_{str}+ z_{\alpha/2} \sqrt{\widehat{V}(\bar{y}_{str})} \right)\] where \(z_{\alpha/2}\) is the \((1-\alpha/2)\)th percentile of the standard normal distribution
In practice, often use \({t}\) distribution with DF = \(n-H\) instead of \(z_{\alpha/2}\).
- \(\textcolor{blue}{DF = n-H}\) should remind you of ANOVA
  - One-way ANOVA: denominator DF = \(N-k\) = # observations \(-\) # groups
  - Stratified Sampling: DF = \(n-H\) = # observations \(-\) # strata
- We lose a DF for each group/stratum mean we estimate
- You may see this DF “rule” written as” DF = # of PSUs – # of Strata
  - PSU = Primary Sampling Unit (which so far has been the observation unit)

Selection Probabilities and Sampling Weights

When an SRS is taken in each stratum:

Selection (Inclusion) Probabilities

Stratum \(h\) has \(N_h\) units
Take an SRS of \(n_h\) units
Thus, \(P(\)unit \(j\) in stratum \(h\) selected\() = \textcolor{red}{\pi_{hj} = \frac{\text{Sample size in the stratum}}{\text{Population size in the stratum}} = \frac{n_h}{N_h}}\)

Sampling Weights

Sample weight is the inverse of the selection (inclusion) probability
Thus, sample weight for unit \(j\) in stratum \(h\) = \(\textcolor{blue}{w_{hj} = \frac{1}{\pi_{hj}}=\frac{N_h}{n_h}}\)

Note also that the sum of the weights for sampled units = population size, \(N\): \[\sum_{i \in \mathcal{S}} w_i = \sum_{h=1}^H \sum_{j\in \mathcal{S}_h} w_{hj} = \sum_{h=1}^H \sum_{j\in \mathcal{S}_h} \frac{N_h}{n_h} = \sum_{h=1}^H n_h \frac{N_h}{n_h} = \sum_{h=1}^H N_h = N\]

Horvitz-Thompson Estimators

We can re-write the estimate of the (overall) total using the weights: \[\begin{aligned} \hat{t}_{str}&= \sum_{h=1}^H N_h \textcolor{myGreen}{\bar{y}_h} = \sum_{h=1}^H N_h \textcolor{myGreen}{\frac{1}{n_h} \sum_{j\in \mathcal{S}_h} y_{hj}} = \sum_{h=1}^H \sum_{j\in \mathcal{S}_h} \textcolor{red}{\frac{N_h}{n_h}} y_{hj}= \sum_{h=1}^H \sum_{j\in \mathcal{S}_h} \textcolor{red}{w_{hj}} y_{hj}\\ \text{(re-index) } &= \sum_{i \in \mathcal{S}} \textcolor{red}{w_i} y_i \end{aligned}\]
This is the Horvitz-Thompson estimator (of the total)!
Similarly, the estimate of the (overall) mean written using the weights: \[\bar{y}_{str}= \sum_{h=1}^H \frac{N_h}{N} \bar{y}_h = \frac{\hat{t}_{str}}{N} = \frac{\sum_{i \in \mathcal{S}} w_i y_i}{\sum_{i \in \mathcal{S}} w_i}\]
This is the Horvitz-Thompson estimator of the mean

(not something new, just showing that the stratified sampling formulas to estimate the mean and total are the Horvitz-Thompson estimators)

Activity 6.1 (Part 4)

Construct a Stratified Sample Estimator (Part 4)

Can a Stratified Sample be EPSEM?

(Reminder) EPSEM = inclusion probability (and thus sample weights) same for all population units

Consider this scenario:

	\(N_h\)	\(n_h\)
Stratum A	1,000	10
Stratum B	4,000	?
Stratum C	500	?

Can you come up with sample sizes for stratum B and stratum C that is EPSEM?

An EPSEM Stratified Sample

Stratified sample is EPSEM if sampling fraction is the same in all strata
- I.e., if \(\frac{n_h}{N_h}\) is the same for all \(h\) (for all strata)
In an EPSEM stratified sample, sample weights are identical but this is not an SRS!
Variance calculations must take into account the design (stratification)
What if the population and your sample look like this:

Two histograms, one on top of the other, showing tri-modal distributions from the full population (top) and a single stratified sample (bottom)

Your variance estimator for the sample mean would be way too large if you treated this like an SRS!