Stratified Sampling

PUBHBIO 7225 Lecture 6

Outline

Topics

  • Stratified Sampling
  • Design-based Theory for Stratified Samples
  • Sampling Weights

Activities

  • 6.1 Construct a Stratified Sample Estimator


Assignments

  • Problem Set 2 due Thursday 9/18/2025 11:59pm via Carmen

What is Stratified Sampling?

The process:

STEP 1: Divide the population into \(H\) subpopulations, called strata

  • Strata do not overlap (disjoint/mutually exclusive)
  • Each sampling unit belongs to exactly one stratum (mutually exhaustive)
  • Strata are identified before sampling (i.e., available on the frame)

STEP 2: Draw an independent probability sample from each stratum (could be an SRS, or other design), then pool the information to obtain overall population estimates

  • Observations within strata tend to be more homogeneous than observations in the population as a whole

  • Reduced variance within strata often leads to a reduced variance for the population-level estimate

  • Thus we want the stratification variable(s) to be related to the \(y\) variable(s) of interest – want to get different means of \(y\) across strata

Throughout this lecture we will assume an SRS is taken in each stratum

Why Stratify?

Reasons to stratify include:

  • Protect against (small) chance of getting a really bad sample

    • Example: sampling \(n=200\) OSU undergraduates
    • If taking an SRS, you might (by chance) get 0 students from outside Ohio
    • Stratifying by in-state/out-of-state guarantees this doesn’t happen
  • Want to ensure specific precision for certain subgroups

    • Example: want to make sure precision is similar for in-state and out-of-state students, even though number in the population differs
  • May be a cost benefit (easier to administer survey in some strata)

  • Lower variance for overall estimates (e.g, estimates for \(\bar{y}_U\))

    • Variance of \(y\) within each stratum is often lower than the variance of \(y\) in the whole population
    • Can lead to cost savings because you can take a smaller sample

Activity 6.1 (Part 1)

Construct a Stratified Sample Estimator (Part 1)

Constructing an Estimator from a Stratified Sample

  • You took a sample of size \(n=15\) from a population of size \(N=35\)

  • But each of the 15 sampled units doesn’t “represent” the same number of population units

Stratum Units in Population Units in Sample Each Sampled Unit “Represents”:
Stratum A 15 5 15/5 = 3
Stratum B 11 5 11/5 = 2.2
Stratum C 9 5 9/5 = 1.8
  • This sample is not EPSEM – simple unweighted average of the sampled \(y\) will be biased

  • Larger strata are “under-represented” in the sample

    • 43% of the population comes from Stratum A (15/35), but

    • 33% of the sample comes from Stratum A (5/15)!

  • We need to “upweight” the units that “represent” more units

Notation for Stratified Sampling

  • \(U\) = the finite population
  • \(N\) = size of the population
  • \(H\) = number of strata
  • \(N_h\) = number of units in stratum \(h\) in the population, with \(\displaystyle \sum_{h=1}^H N_h = N\)
  • \(\mathcal{S}_h\) = a particular sample in stratum \(h\), with \(\displaystyle \bigcup_{h=1}^H \mathcal{S}_h = \mathcal{S}\)
  • \(n_h\) = size of the sample in stratum \(h\), with \(\displaystyle \sum_{h=1}^H n_h = n\)
  • \(y_{hj}\) = characteristic of interest for the \(j\)th unit in the \(h\)th stratum

Example (from Activity)

  • \(U\) = population (collection of all the units)
  • \(N=35\) (total number of units in the population)
  • \(H=3\) strata (A, B, C)
Stratum Population Size Sample Size
A \(N_1 = 15\) \(n_1=5\)
B \(N_2 = 11\) \(n_2=5\)
C \(N_3 = 9\) \(n_3=5\)
Total \(N=N_1+N_2+N_3=35\) \(n=n_1+n_2+n_3=15\)
  • Examples of \(y_{hj}\):
    • value of \(y\) for unit \(j=2\) in stratum \(h=1\) (stratum A) → \(y_{12} = 9\)
    • value of \(y\) for unit \(j=6\) in stratum \(h=2\) (stratum B) → \(y_{26} = 6\)
    • value of \(y\) for unit \(j=1\) in stratum \(h=3\) (stratum C) → \(y_{31} = 11\)

Estimates (Step 1): Stratum-Specific Estimates

Quantity Estimand (Truth) Estimate
Stratum Mean \(\displaystyle \bar{y}_{hU}=\frac{1}{N_h} \sum_{j=1}^{N_h} y_{hj}\) \(\displaystyle \bar{y}_h = \frac{1}{n_h} \sum_{j\in \mathcal{S}_h} y_{hj}\)
Stratum Total \(\displaystyle t_h = \sum_{j=1}^{N_h} y_{hj}= N_h \bar{y}_{hU}\) \(\displaystyle \hat{t}_h = N_h \bar{y}_h = \frac{N_h}{n_h} \sum_{j\in \mathcal{S}_h} y_{hj}\)
Stratum Variance (of \(y\)) \(\displaystyle S_h^2=\frac{1}{N_h-1} \sum_{j=1}^{N_h} (y_{hj}- \bar{y}_{hU})^2\) \(\displaystyle s_h^2=\frac{1}{n_h-1} \sum_{j\in \mathcal{S}_h} (y_{hj}- \bar{y}_h)^2\)
Stratum Proportion \(\displaystyle p_{hU}=\bar{y}_{hU}=\frac{1}{N_h} \sum_{j=1}^{N_h} y_{hj}\) \(\displaystyle \hat{p}_h = \frac{1}{n_h} \sum_{j\in \mathcal{S}_h} y_{hj}\)
Stratum Variance (of binary \(y\)) \(\displaystyle S_h^2 = \frac{N_h}{N_h-1} p_{hU}(1-p_{hU})\) \(\displaystyle s_h^2 = \frac{n_h}{n_h-1} \hat{p}_{h}(1-\hat{p}_{h})\)
  • These are just the SRS formulae, with extra \(h\) subscript (and \(j\) instead of \(i\) for unit)]

Example (from Activity)

Suppose my activity sample was:

A table with data from three strata (A, B, and C). Rows with yellow highlighting indicate a stratified random sample. Stratum A has a sample of 5 rows, Stratum B has a sample of 5, and Stratum C has a sample of 5.

We can then calculate/estimate:

Stratum A \((h=1)\): \(N_1 = 15, n_1=5\)
Stratum A population mean: \[\begin{aligned} %\bar{y}_{hU} & =\frac{1}{N_h} \sum_{j=1}^{N_h} y_{hj}\\ \bar{y}_{1U} &=\frac{1}{N_1} \sum_{j=1}^{N_1} y_{1j} = \frac{1}{15} \left(6+9+3+9+\cdots+5\right) = \textcolor{red}{\bf 4.4} \end{aligned}\]

Stratum A sample mean: \[\begin{aligned} %\bar{y}_h &= \frac{1}{n_h} \sum_{j\in \mathcal{S}_h} y_{hj}\\ \bar{y}_1 &= \frac{1}{n_1} \sum_{j\in \mathcal{S}_1} y_{1j} = \frac{1}{5} \left(9+10+1+3+5\right) = \textcolor{blue}{\bf 5.6} \end{aligned}\]

My estimate of the mean from my sample is 5.6; close to the true mean of 4.4

Not new formula – just the SRS formulae

Example (from Activity)

Stratum A population variance of \(y\):

\[\begin{aligned} S_h^2&=\frac{1}{N_h-1} \sum_{j=1}^{N_h} (y_{hj}- \bar{y}_{hU})^2 \\ S_1^2&=\frac{1}{N_1-1} \sum_{j=1}^{N_1} (y_{1j} - \bar{y}_{1U})^2 \\ &= \frac{1}{15-1} \sum_{j=1}^{15} (y_{1j} - 4.4)^2 \\ & = \frac{1}{15-1} \left[ (6-4.4)^2 + \cdots + (5-4.4)^2 \right] \\ &= \textcolor{red}{\bf 10.83} \end{aligned}\]

Stratum A sample variance of \(y\):

\[\begin{aligned} s_h^2&=\frac{1}{n_h-1} \sum_{j\in \mathcal{S}_h} (y_{hj}- \bar{y}_h)^2 \\ s_1^2&=\frac{1}{n_1-1} \sum_{j\in \mathcal{S}_h} (y_{1j} - \bar{y}_1)^2 \\ &= \frac{1}{5-1} \sum_{j\in \mathcal{S}_h} (y_{1j} - 5.6)^2 \\ &= \frac{1}{5-1}\left[ (9-5.6)^2 + \cdots + (5-5.6)^2 \right] \\ &= \textcolor{blue}{\bf 14.8} \end{aligned}\]

My estimate of \(V(y)\) from my sample is 14.8; close-ish to the true variance of 10.8

Again, not new formula – just the SRS formulae

Example (from Activity)

Similar calculations for the other 2 strata yield:

Means:

Stratum Population My Sample
Stratum A \((h=1)\) \(\bar{y}_{1U} = 4.4\) \(\bar{y}_1 = 5.6\)
Stratum B \((h=2)\) \(\bar{y}_{2U} = 8.36\) \(\bar{y}_2 = 7.8\)
Stratum C \((h=3)\) \(\bar{y}_{3U} = 11.22\) \(\bar{y}_3 = 11.8\)

Variances: (variance of \(y\), not of the mean of \(y\))

Stratum Population My Sample
Stratum A \((h=1)\) \(S_1^2 = 10.83\) \(s_1^2 = 14.8\)
Stratum B \((h=2)\) \(S_2^2 = 6.45\) \(s_2^2 = 6.7\)
Stratum C \((h=3)\) \(S_3^2 = 7.44\) \(s_3^2 = 6.7\)

Next step will be to combine the stratum estimates to get overall estimate (of the mean)

Activity 6.1 (Part 2)

Construct a Stratified Sample Estimator (Part 2)

Estimates (Step 2): Combining Stratum Estimates

Then combine stratum-specific estimates to get overall estimates:

Quantity Estimand (Truth) Estimate
Overall Total \(\displaystyle t = \sum_{h=1}^H \sum_{j=1}^{N_h} y_{hj}= \sum_{h=1}^H t_h = \sum_{h=1}^H N_h \bar{y}_{hU}\) \(\displaystyle \hat{t}_{str}= \sum_{h=1}^H \hat{t}_h= \sum_{h=1}^H N_h \bar{y}_h\)
Overall Mean \(\displaystyle \bar{y}_U = \frac{t}{N} = \frac{1}{N} \sum_{h=1}^H N_h \bar{y}_{hU} = \sum_{h=1}^H \frac{N_h}{N} \bar{y}_{hU}\) \(\displaystyle \bar{y}_{str}= \sum_{h=1}^H \frac{N_h}{N} \bar{y}_h\)
Overall Proportion \(\displaystyle p = \bar{y}_U = \sum_{h=1}^H \frac{N_h}{N} p_{hU}\) \(\displaystyle \hat{p}_{str}= \sum_{h=1}^H \frac{N_h}{N} \hat{p}_h\)

Intuition:

  • Total: sum up the stratum totals
  • Means/Proportions: weighted average of stratum quantities, weighting by the proportion of the population in each stratum

Example (from Activity)

Population My Sample
Mean Variance Mean Variance
Stratum A \(N_1=15\) \(n_1=5\) \(\bar{y}_{1U} = 4.4\) \(S_1^2 = 10.83\) \(\bar{y}_1 = 5.6\) \(s_1^2 = 14.8\)
Stratum B \(N_2=11\) \(n_2=5\) \(\bar{y}_{2U} = 8.36\) \(S_2^2 = 6.45\) \(\bar{y}_2 = 7.8\) \(s_2^2 = 6.7\)
Stratum C \(N_3=9\) \(n_3=5\) \(\bar{y}_{3U} = 11.22\) \(S_3^2 = 7.44\) \(\bar{y}_3 = 11.8\) \(s_3^2 = 6.7\)

Overall population mean: \[\bar{y}_U = \sum_{h=1}^H \frac{N_h}{N} \bar{y}_{hU} = \frac{N_1}{N} \bar{y}_{1U} + \frac{N_2}{N} \bar{y}_{2U} + \frac{N_3}{N} \bar{y}_{3U} = \frac{15}{35} (4.4) + \frac{11}{35} (8.36) + \frac{9}{35} (11.22) = \textcolor{red}{\bf 7.4}\]

My sample estimate of overall mean: \[\begin{aligned} \bar{y}_{str}= \sum_{h=1}^H \frac{N_h}{N} \bar{y}_h &= \frac{N_1}{N} \bar{y}_1 + \frac{N_2}{N} \bar{y}_2 + \frac{N_3}{N} \bar{y}_3 = \frac{15}{35} (5.6) + \frac{11}{35} (7.8) + \frac{9}{35} (11.8) = \textcolor{blue}{\bf 7.89} \\ \end{aligned}\]

Expectation and Variance of the Estimated Mean

  • Since we are taking an SRS within each stratum, stratum-specific estimates are unbiased: \[E(\bar{y}_h) = \bar{y}_{hU} \quad\quad\quad E(\hat{t}_h) = t_{hU} \quad\quad\quad E(\hat{p}_h) = p_{hU}\]

  • This implies that: \[\begin{flalign} E(\bar{y}_{str}) &= E\left(\sum_{h=1}^H \frac{N_h}{N} \bar{y}_h \right) = &% \sum_{h=1}^H \frac{N_h}{N} E(\ybar_h) = \sum_{h=1}^H \frac{N_h}{N} \ybar_{hU} = \ybar_U \end{flalign}\]

  • Samples are taken independently in each stratum.

  • Remember that if \(X\) and \(Y\) are independent, \(V(X+Y) = V(X)+V(Y)\). Thus: \[\begin{flalign} V(\bar{y}_{str}) &= V\left(\sum_{h=1}^H \frac{N_h}{N} \bar{y}_h \right) =&% = \sum_{h=1}^H V \left( \frac{N_h}{N} \ybar_h \right) = \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 V(\ybar_h) = \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \fpchpar \frac{S_h^2}{n_h} \end{flalign}\]

Estimators are Random Variables

Estimate Expected Value Variance Formula
Mean: \(\bar{y}_{str}\) \(E(\bar{y}_{str})= \bar{y}_U\) Truth: \(\displaystyle V(\bar{y}_{str})= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h}\)
Estimate: \(\displaystyle \widehat{V}(\bar{y}_{str})= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{s_h^2}{n_h}\)
Total: \(\hat{t}_{str}\) \(E(\hat{t}_{str})=t\) Truth: \(\displaystyle V(\hat{t}_{str})=\sum_{h=1}^H N_h^2 \left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h}\)
Estimate: \(\displaystyle \widehat{V}(\hat{t}_{str})=\sum_{h=1}^H N_h^2 \left(1-\frac{n_h}{N_h}\right)\frac{s_h^2}{n_h}\)
Proportion: \(\hat{p}_{str}\) \(E(\hat{p}_{str})=p\) Truth: \(\displaystyle V(\hat{p}_{str})= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(\frac{N_h-n_h}{N_h-1}\right) \frac{p_{hU}(1-p_{hU})}{n_h}\)
Estimate: \(\displaystyle \widehat{V}(\hat{p}_{str}) = \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{\hat{p}_h(1-\hat{p}_h)}{n_h-1}\)

Example (from Activity)

Population My Sample
Mean Variance Mean Variance
Stratum A \(N_1=15\) \(n_1=5\) \(\bar{y}_{1U} = 4.4\) \(S_1^2 = 10.83\) \(\bar{y}_1 = 5.6\) \(s_1^2 = 14.8\)
Stratum B \(N_2=11\) \(n_2=5\) \(\bar{y}_{2U} = 8.36\) \(S_2^2 = 6.45\) \(\bar{y}_2 = 7.8\) \(s_2^2 = 6.7\)
Stratum C \(N_3=9\) \(n_3=5\) \(\bar{y}_{3U} = 11.22\) \(S_3^2 = 7.44\) \(\bar{y}_3 = 11.8\) \(s_3^2 = 6.7\)

True sampling variance of \(\bar{y}_{str}\): \[\begin{aligned} V(\bar{y}_{str}) &= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h} \\ & = \left(\frac{15}{35}\right)^2 \left(1 - \frac{5}{15} \right) \frac{10.83}{5} + \left(\frac{11}{35}\right)^2 \left(1 - \frac{5}{11} \right) \frac{6.45}{5} + \left(\frac{9}{35}\right)^2 \left(1 - \frac{5}{9} \right) \frac{7.44}{5} = \textcolor{red}{0.378} \end{aligned}\]

Example (from Activity)

Population My Sample
Mean Variance Mean Variance
Stratum A \(N_1=15\) \(n_1=5\) \(\bar{y}_{1U} = 4.4\) \(S_1^2 = 10.83\) \(\bar{y}_1 = 5.6\) \(s_1^2 = 14.8\)
Stratum B \(N_2=11\) \(n_2=5\) \(\bar{y}_{2U} = 8.36\) \(S_2^2 = 6.45\) \(\bar{y}_2 = 7.8\) \(s_2^2 = 6.7\)
Stratum C \(N_3=9\) \(n_3=5\) \(\bar{y}_{3U} = 11.22\) \(S_3^2 = 7.44\) \(\bar{y}_3 = 11.8\) \(s_3^2 = 6.7\)

Estimate of sampling variance of \(\bar{y}_{str}\): \[\begin{aligned} \widehat{V}(\bar{y}_{str}) &= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{s_h^2}{n_h} \\ & = \left(\frac{15}{35}\right)^2 \left(1 - \frac{5}{15} \right) \frac{14.8}{5} + \left(\frac{11}{35}\right)^2 \left(1 - \frac{5}{11} \right) \frac{6.7}{5} + \left(\frac{9}{35}\right)^2 \left(1 - \frac{5}{9} \right) \frac{6.7}{5} = \textcolor{blue}{0.474} \end{aligned}\]

Activity 6.1 (Part 3)

Construct a Stratified Sample Estimator (Part 3)

Simulated Data from Activity

  • I repeated the activity 10,000 times (using R!), taking two types of samples:

    1. SRS of size \(n=15\)
    • \(\bar{y}=\) unweighted average of the 15 sampled units (SRS estimator)
    1. Stratified sample of \(n_h=5\) in each stratum (total of 15 units sampled)
    • \(\bar{y}_{str}=\) weighted average of the stratum means (stratified sample estimator)
  • Resulting estimates of \(\bar{y}_U\) (averaged over replicates):

Two histograms, one on top of the other, showing the normal-shaped distributions of the mean from an SRS (top) and the mean from a stratified sample (bottom)

Method Mean Variance of Mean
SRS \((\bar{y})\) 7.3991 0.6137
Stratified \((\bar{y}_{str})\) 7.4044 0.3805
Truth \((\bar{y}_U)\) 7.4

Stratified estimator: unbiased and more precise!

Confidence Intervals for Stratified Samples

  • If the size of the strata (\(n_h\)) are all large, or the number of strata (\(H\)) is large, then an approximately \(\alpha\)-level confidence interval for the true mean \(\bar{y}_U\) is given by: \[\left( \bar{y}_{str}- z_{\alpha/2} \sqrt{\widehat{V}(\bar{y}_{str})}, \quad \bar{y}_{str}+ z_{\alpha/2} \sqrt{\widehat{V}(\bar{y}_{str})} \right)\] where \(z_{\alpha/2}\) is the \((1-\alpha/2)\)th percentile of the standard normal distribution

  • In practice, often use \({t}\) distribution with DF = \(n-H\) instead of \(z_{\alpha/2}\).

    • \(\textcolor{blue}{DF = n-H}\) should remind you of ANOVA

      • One-way ANOVA: denominator DF = \(N-k\) = # observations \(-\) # groups

      • Stratified Sampling: DF = \(n-H\) = # observations \(-\) # strata

    • We lose a DF for each group/stratum mean we estimate

    • You may see this DF “rule” written as” DF = # of PSUs – # of Strata

      • PSU = Primary Sampling Unit (which so far has been the observation unit)

Selection Probabilities and Sampling Weights

When an SRS is taken in each stratum:

Selection (Inclusion) Probabilities

  • Stratum \(h\) has \(N_h\) units

  • Take an SRS of \(n_h\) units

  • Thus, \(P(\)unit \(j\) in stratum \(h\) selected\() = \textcolor{red}{\pi_{hj} = \frac{\text{Sample size in the stratum}}{\text{Population size in the stratum}} = \frac{n_h}{N_h}}\)

Sampling Weights

  • Sample weight is the inverse of the selection (inclusion) probability

  • Thus, sample weight for unit \(j\) in stratum \(h\) = \(\textcolor{blue}{w_{hj} = \frac{1}{\pi_{hj}}=\frac{N_h}{n_h}}\)

Note also that the sum of the weights for sampled units = population size, \(N\): \[\sum_{i \in \mathcal{S}} w_i = \sum_{h=1}^H \sum_{j\in \mathcal{S}_h} w_{hj} = \sum_{h=1}^H \sum_{j\in \mathcal{S}_h} \frac{N_h}{n_h} = \sum_{h=1}^H n_h \frac{N_h}{n_h} = \sum_{h=1}^H N_h = N\]

Horvitz-Thompson Estimators

  • We can re-write the estimate of the (overall) total using the weights: \[\begin{aligned} \hat{t}_{str}&= \sum_{h=1}^H N_h \textcolor{myGreen}{\bar{y}_h} = \sum_{h=1}^H N_h \textcolor{myGreen}{\frac{1}{n_h} \sum_{j\in \mathcal{S}_h} y_{hj}} = \sum_{h=1}^H \sum_{j\in \mathcal{S}_h} \textcolor{red}{\frac{N_h}{n_h}} y_{hj}= \sum_{h=1}^H \sum_{j\in \mathcal{S}_h} \textcolor{red}{w_{hj}} y_{hj}\\ \text{(re-index) } &= \sum_{i \in \mathcal{S}} \textcolor{red}{w_i} y_i \end{aligned}\]

  • This is the Horvitz-Thompson estimator (of the total)!

  • Similarly, the estimate of the (overall) mean written using the weights: \[\bar{y}_{str}= \sum_{h=1}^H \frac{N_h}{N} \bar{y}_h = \frac{\hat{t}_{str}}{N} = \frac{\sum_{i \in \mathcal{S}} w_i y_i}{\sum_{i \in \mathcal{S}} w_i}\]

  • This is the Horvitz-Thompson estimator of the mean

(not something new, just showing that the stratified sampling formulas to estimate the mean and total are the Horvitz-Thompson estimators)

Activity 6.1 (Part 4)

Construct a Stratified Sample Estimator (Part 4)

Can a Stratified Sample be EPSEM?

(Reminder) EPSEM = inclusion probability (and thus sample weights) same for all population units

  • Consider this scenario:
\(N_h\) \(n_h\)
Stratum A 1,000 10
Stratum B 4,000 ?
Stratum C 500 ?
  • Can you come up with sample sizes for stratum B and stratum C that is EPSEM?

An EPSEM Stratified Sample

  • Stratified sample is EPSEM if sampling fraction is the same in all strata

    • I.e., if \(\frac{n_h}{N_h}\) is the same for all \(h\) (for all strata)
  • In an EPSEM stratified sample, sample weights are identical but this is not an SRS!

  • Variance calculations must take into account the design (stratification)

  • What if the population and your sample look like this:

Two histograms, one on top of the other, showing tri-modal distributions from the full population (top) and a single stratified sample (bottom)
  • Your variance estimator for the sample mean would be way too large if you treated this like an SRS!