PUBHBIO 7225 Lecture 10
Topics
Activities
Assignments
To use a variable for stratification at the design stage, you must know the value of that variable for all units in the population before you conduct the survey
For some surveys, frame data may be quite limited (especially phone surveys)
Example: Ohio Medicaid Assessment Survey (OMAS) 2017
How can we reconcile this? Seems like we oversampled white people. . .
You are interested in student perception of campus culture at OSU (Columbus Campus)
There are \(N=\) 61443 students on the Columbus campus
You have access to a sampling frame that contains email addresses, but no additional information on the students (i.e., rank, gender, etc.)
Thus, you take a simple random sample of \(n=\) 100 students and send them a survey (cannot stratify on anything b/c your sampling frame doesn’t contain anything useful)
Probability of selection and sample weight for each of the \(n=\) 100 students sampled:
Miraculously, everyone replies (we will get to nonresponse later!)
On the survey you ask questions about:
Level: Undergraduate, Graduate, Professional
Residency: From Ohio, Not from Ohio
Your sample has 67 undergrads, 23 grad students, 10 professional students
Since an SRS, unweighted proportions are the same as weighted
Resulting estimates:
But, you know from the OSU Statistical Summary that there are 46815 undergraduates, 11404 graduate students, and 3224 professional students (Statistical Summary 2024-2025)
Comparing the proportions:
| Sample | Population | |||
|---|---|---|---|---|
| Count | Proportion | Count | Proportion | |
| Undergraduate | 67 | 0.67 | 46815 | 46815/61443 = 0.762 |
| Graduate | 23 | 0.23 | 11404 | 11404/61443 = 0.186 |
| Professional | 10 | 0.1 | 3224 | 3224/61443 = 0.052 |
| Total | 100 | 61443 |
It seems that grad students and professional students are over-represented in your sample, and undergrads are under-represented
What can you do about this?
Each student in your sample had a weight of \(w_i = N/n =\) 61443/100 = 614.4
Thus, our estimates of the population size in each group :
| Group | # in sample | Weight \((w_i)\) | Estimated Total \((\hat{t})\) | True Total \((t)\) |
|---|---|---|---|---|
| Undergrads | 67 | 614.4 | 67 \(\times\) 614.4 = 41166.8 | 46815 |
| Grads | 23 | 614.4 | 23 \(\times\) 614.4 = 14131.9 | 11404 |
| Profs | 10 | 614.4 | 10 \(\times\) 614.4 = 6144.3 | 3224 |
Our estimates are not equal to the known truth
But, we can do something about this!
We can adjust the weights within each group so that estimated totals equal true totals
This is called Poststratification
Poststratification = dividing the sample into subgroups (post-strata) based on known population characteristics and adjusting the survey weights within those subgroups to match the population totals
Steps for poststratification:
| Group | # in sample | Weight \((w_i)\) | Estimated Total \((\hat{t})\) | True Total \((t)\) |
|---|---|---|---|---|
| Undergrads | 67 | 614.4 | 67 \(\times\) 614.4 = 41166.8 | 46815 |
| Grads | 23 | 614.4 | 23 \(\times\) 614.4 = 14131.9 | 11404 |
| Profs | 10 | 614.4 | 10 \(\times\) 614.4 = 6144.3 | 3224 |
Should remind you of stratified sampling – different weight in each (post)stratum – but now the \(n_h\) are not fixed in advance so we have added variability
For unit \(j\) in post-stratum \(h\): \[\begin{aligned} \text{\textcolor{blue}{Poststratified weight}} &= \text{\textcolor{red}{original weight}} \times \text{\textcolor{purple}{adjustment factor}} & \\ \textcolor{blue}{w_{hj}^{PS}} &= \textcolor{red}{w_{hj}} \times \textcolor{purple}{\frac{\text{Population total in post-stratum}~h}{\text{Estimated total in post-stratum}~h}} \\ & = \textcolor{red}{w_{hj}} \times \textcolor{purple}{\frac{N_h}{\widehat{N}_h}} \end{aligned}\]
You can poststratify on multiple variables if you know the population totals in their crosstabulation
| Post-stratum | Sample Size (\(n_h\)) | Orig. Weight (\(w_{hj}\)) | Est. Total (\(\widehat{N}_h\)) | True Total (\(N_h\)) | Poststratified Weight \(w_{hj}^{PS} = w_{hj} \times \frac{N_h}{\widehat{N}_h}\) |
|---|---|---|---|---|---|
| Undergrad | 67 | 614.4 | 41166.81 | 46815 | 614.4\(\times\) 46815 / 41166.81 = 698.7 |
| Grad | 23 | 614.4 | 14131.89 | 11404 | 614.4\(\times\) 11404 / 14131.89 = 495.8 |
| Professional | 10 | 614.4 | 6144.3 | 3224 | 614.4\(\times\) 3224 / 6144.3 = 322.4 |
Notice that the weights went up for undergrads, down for grad and professional
These new weights, \(w_{hj}^{PS}\), replace the original weights, \(w_{hj}\) for all calculations
This effectively creates a stratified sample → use stratified sampling formulae for estimates
| Post-stratum | Sample Proportion from Ohio (\(\hat{p}_h\)) | True Stratum Population Size (\(N_h\)) |
|---|---|---|
| Undergrad \((h=1)\) | 49/67 = 0.7313433 | 46815 |
| Grad \((h=2)\) | 14/23 = 0.6086957 | 11404 |
| Professional \((h=3)\) | 4/10 = 0.4 | 3224 |
| Total | 67/100 = 0.67 | 61443 |
Poststratified estimate of overall proportion from Ohio: \[\begin{flalign} \hat{p}_{PS} &= \sum_{h=1}^H \frac{N_h}{N} \hat{p}_h = \frac{N_1}{N} \hat{p}_1 + \frac{N_1}{N} \hat{p}_1 + \frac{N_3}{N} \hat{p}_3 & \\ &= \frac{\text{46815}}{\text{61443}} \times \text{0.731} + \frac{\text{11404}}{\text{61443}} \times \text{0.609} + \frac{\text{3224}}{\text{61443}} \times \text{0.4} = \textbf{0.691} \end{flalign}\]
Slightly higher than the original estimate of 67% – undergrads were underrepresented in the sample, and they are more likely to be from Ohio
Let \(y_i = \begin{cases} 1 & \text{person $i$ is from Ohio} \\ 0 & \text{person $i$ is NOT from Ohio} \end{cases}\)
Poststratified weights: \(w_i^{PS} = \begin{cases} \text{698.7} & \text{Undergrad (stratum 1)} \\ \text{495.8} & \text{Grad (stratum 2)} \\ \text{322.4} & \text{Professional (stratum 3)} \end{cases}\) \[\begin{aligned} \hat{t}_{PS} = \sum_{i \in S} w_i^{PS} y_i &= \sum_{h = 1}^{3} \sum_{j \in S_h} w_{hj}^{PS} y_{hj} = \underbrace{\sum_{j \in S_1} \text{698.7} \times y_{1j}}_{\text{stratum 1}} + \underbrace{\sum_{j \in S_2} \text{495.8} \times y_{2j}}_{\text{stratum 2}} + \underbrace{\sum_{j \in S_3} \text{322.4} \times y_{3j}}_{\text{stratum 3}} \\ &= \text{698.7} \times \underbrace{\sum_{j \in S_1} y_{1j}}_{\text{\# from OH}} + \text{495.8} \times \underbrace{\sum_{j \in S_2} y_{2j}}_{\text{\# from OH}} + \text{322.4} \times \underbrace{\sum_{j \in S_3} y_{3j}}_{\text{\# from OH}}\\ & = \text{698.7} \times \text{49} + \text{495.8} \times \text{14} + \text{322.4} \times \text{4} = \text{42469} \end{aligned}\]
Thus \(\hat{p}_{PS} = \frac{\hat{t}_{PS}}{N} = \frac{\text{42469}}{\text{61443}} = \text{0.691}\)
Unlike stratification at the design stage, the \(n_h\) in poststrata are random variables
Take a different SRS and the number in each poststrata will be different (unlike fixed sizes of design strata)
But – the poststratified mean estimate is still an estimator for the true mean \[\begin{aligned}
E[\bar{y}_{PS}] &= E\left[\sum_{h=1}^H \frac{N_h}{N} \bar{y}_h \right] = E\left[ E\left(\sum_{h=1}^H \frac{N_h}{N} \bar{y}_h \bigg\vert n_1, n_2, \dots, n_h \right) \right] \qquad \text{\small (condition on fixed poststratum sizes)} \\
&= E\left[ \sum_{h=1}^H \frac{N_h}{N} E\left( \bar{y}_h \bigg\vert n_1, n_2, \dots, n_h \right) \right] = E\left[ \sum_{h=1}^H \frac{N_h}{N} \bar{y}_{hU} \right] = \bar{y}_U
\end{aligned}\]
So why would we ever stratify at the design stage?
Technically, the variance of \(\bar{y}_{PS}\) is larger than \(\bar{y}_{str}\) because the \(n_h\) are random
Also, guard against a fluke “bad sample”
If the strata are fixed at the design stage (\(n_h\) are fixed), we have the usual formula: \[V(\bar{y}_{str}) = \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h}\]
With poststratification, \(n_h\) are random variables and we can do some trickery to get: \[\begin{aligned} V(\bar{y}_{PS}) &= E[ V(\bar{y}_{str} | n_1, n_2, \dots, n_h)] + V[ E(\bar{y}_{str} | n_1, n_2, \dots, n_h)] & \\ &= E\left[ \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h}\right] + \underbrace{V(\bar{y}_U)}_{\text{\small (= 0)}} = E\left[\sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \frac{S_h^2}{n_h} - \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \frac{S_h^2}{N_h}\right]\\ &= \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 S_h^2 E\left(\frac{1}{n_h}\right) - \sum_{h=1}^H \left(\frac{N_h}{N}\right)^2 \frac{S_h^2}{N_h} \quad (\text{\small only random variable is }n_h)\\ &= \dots \text{using an approximation for } E(1/n_h) \text{ and some algebra}\dots \\ &\approx \underbrace{\sum_{h=1}^H \frac{N_h}{N}\left(1-\frac{n}{N}\right)\frac{S_h^2}{n}}_{\text{variance under proportional allocation!}} + \underbrace{\sum_{h=1}^H \left(1-\frac{N_h}{N} \right) \frac{S_h^2}{n^2}}_{\text{extra variability due to $n_h$ being random}} \end{aligned}\]
\(\displaystyle V(\bar{y}_{PS}) = \textcolor{red}{\sum_{h=1}^H \frac{N_h}{N}\left(1-\frac{n}{N}\right)\frac{S_h^2}{n}} + \textcolor{blue}{\sum_{h=1}^H \left(1-\frac{N_h}{N} \right) \frac{S_h^2}{n^2}}\)
Usually the 2nd term is really small compared to the first term
Thus, variance of the poststratified mean (of an SRS) is usually effectively the same as variance of the stratified mean under proportional allocation!
This holds as long as the total sample size \(n\) is “large” (and it is in most surveys)
Poststratification (or related weight adjustment methods) are very commonly used in household surveys
Often don’t have demographic info on the frame, or it is limited
Do have access to population counts of people based on demographics such as age/race groups from the Census or other source
Thus, use age and race (cross-classified) to post-stratify
Importantly – poststratification can be used with complex survey designs (not just SRS)
Poststratification
Often we can get the population totals for a set of variables – age, race, etc. – but only the marginal counts
A method called raking can be used to adjust the weights to achieve the correct (true) marginal totals
Also called “iterative proportional fitting”
In raking you iteratively adjust the sample weights to match each set of marginal totals, going back and forth between margins (variables) until convergence (until all margins “match” the truth)
This is easiest seen in an example…
| MARGINS | Residency | ||
| Level | OH | Non-OH | Total |
| Undergrad | ? | ? | 46815 |
| Grad | ? | ? | 11404 |
| Professional | ? | ? | 3224 |
| Total | 42191 | 19252 | 61443 |
| SAMPLE DATA | Residency | ||
| Level | OH | Non-OH | Total |
| Undergrad | 49 | 18 | 67 |
| Grad | 14 | 9 | 23 |
| Professional | 4 | 6 | 67 |
| Total | 67 | 33 | 100 |
Start with the row variable (level), and adjust the sample weights via poststratification
Now totals for level are right – but totals for the other variable (residency) are not
So, adjust the weights via poststratification using the other variable (residency)
| Group | Sample Count | Original Weight | Raked Weight |
|---|---|---|---|
| Undergrad / Ohio | 49 | 614.4 | 694.615 |
| Grad / Ohio | 14 | 614.4 | 491.583 |
| Professional / Ohio | 4 | 614.4 | 318.189 |
| Undergrad / Not Ohio | 18 | 614.4 | 709.937 |
| Grad / Not Ohio | 9 | 614.4 | 502.427 |
| Professional / Not Ohio | 6 | 614.4 | 325.208 |
Notice that now we will get the correct totals for level and for residency!
Ex: Number of students from Ohio = 49 \(\times\) 694.615 + 14 \(\times\) 491.583 + 4 \(\times\) 318.189 = 42191 ✓
Ex: Number of undergrads = 49 \(\times\) 694.615 + 18 \(\times\) 709.937 = 46815 ✓
But, we are not guaranteed to get the correct level-by-residency totals
Pros
Can be done with more than two variables
Used extensively in U.S. household surveys where Census or other source can provide high quality totals
Used even when totals are not technically “known” but rather estimated with high precision (and with higher precision than fully crossed cells needed for poststratification)
Cons
Does not always converge… diagnostics are often used to make sure adjustments are not too severe
Can cause inflation of variances if variables used for poststratification are not related to the survey outcomes of interest
Importantly – raking can be used with complex survey designs (not just SRS)
Theoretically, could use a lot of variables for poststratification and raking
Options:
Collapse categories in the variable(s) used for poststratification/raking (ad hoc solution)
Use a model-based method that can handle more poststrata/raking variables and ensure groupings are not too small or weight adjustments are not too extreme
Poststratification and raking are also commonly used for nonresponse adjustment and to adjust for undercoverage, so we will come back to them later in the course
Both poststratification and raking fall under an umbrella method called weight calibration – you may see this terminology used to describe your chosen group project surveys
PUBHBIO 7225