PUBHBIO 7225 Lecture 15
Topics
Combining Design Features
Calculating Sample Weights
Activities
Assignments
Generally speaking, most surveys combine multiple of the features we have covered:
Simple random sampling (without replacement)
Stratification
Cluster sampling
Probability Proportional to Size sampling
Many large-scale surveys are stratified multistage designs where:
PSUs are grouped into strata
PSUs are sampled (often via PPS) independently within each stratum
SSUs are sampled (often via SRS) from within each selected PSU
Tertiary sampling units (TSUs) are sampled from within each selected SSU… etc.
Technically, you could assemble the various formulae we have used for each of these different sampling techniques to (hand-)calculate point estimates and variance estimates for population-level quantities.
Goal: Estimate percent of beds using netting (Malaria prevention) in rural Gambia
(D’Alessandro et al., (1994). Nationwide survey of bednet use in rural Gambia. Bulletin of the World Health Organization, 72, 391-394.)
Stage 1:
Stratify districts by region: Western, Central, Eastern
Take a PPS sample of 5 districts within each stratum, with population size in each district as the size variable (from national census data)
Stage 2:
Stratify villages by whether or not they had a primary health care (PHC) facility
Take a PPS sample of 2 villages within each stratum, with population size in each village as the size variable (from national census data)
Stage 3:
Select 6 compounds per village via SRS
Record the number of beds and how many had nets within each compound
PSU = ? SSU = ? TSU = ?
Let’s draw this design:
To estimate the total number of beds in rural Gambia that use netting, start from the last stage and work backwards:
Record total number of beds with nets for each compound (TSU)
Estimated total number of beds with nets for each village (SSU): (# compounds in village) \(\times\) (average # beds w/nets per compound)
Estimated total number of beds with nets for PHC villages: \[\sum_{\text{selected villages in PHC stratum}} \frac{\text{estimated total number of beds for the village (step 2)}}{\text{probability of selection from PPS}}\] Repeat for non-PHC villages
Estimated total number of beds with nets in each district (PSU): \[\text{estimated total in PHC villages (step 3)} + \text{estimated total in non-PHC villages (step 3)}\]
Estimated total number of beds with nets in Eastern region: \[\sum_{\text{selected districts in Eastern stratum}} \frac{\text{estimated total number of beds for the district (step 4)}}{\text{probability of selection from PPS}}\] Repeat for Central and Western regions
Estimated overall total: \(\text{Estimated total in Eastern} + \text{Estimated total in Central} + \text{Estimated total in Western}\) (step 5)
We can calculate sampling weights to avoid the (ugly) process of working backwards through the stages to get to the ultimate total
Fortunately, the principles of conditional probability make this straightforward
Start with the probability of selection: \[\begin{aligned} P(\text{TSU }k&\text{ in SSU }j\text{ in PSU }i\text{ selected}) = \\ &= P(\text{PSU }i\text{ selected}) \times P(\text{SSU }j\text{ selected }|\text{ PSU }i\text{ selected}) \times P(\text{TSU }k\text{ selected }|\text{ PSU }i\text{ and SSU }j\text{ selected})\\ \pi_{ijk} &= \pi_{i} \times \pi_{j|i} \times \pi_{k|ij} \end{aligned}\]
These weights (inverse of selection probabilities) are often called the base weights
Often, post-survey adjustments to the weights are done to provide the final weights used for analysis
Point estimates can be obtained using just the weights
Variance estimates (with-replacement approximation) require full design information:
weights
stratification information
clustering information
Sampling weights only need to be calculated for the units that end up in the sample
Sampling weights for units not in the sample are 0
Suppose I have the following two-stage design:
Stage 1: select 3 geographic regions via PPS (based on population size)
Stage 2: stratify people in the regions into two strata: younger (19-64) and older (65+)
In this design, PSU = geographic region and SSU = person
Total sample size = 3 regions \(\times\) 2 strata \(\times\) 10 people per stratum = 60
Let’s calculate the sampling weights for the 60 sampled people
My 3 sampled regions (PSUs) are →
Total population size = 10,000
| Region | Population | # Young | # Older | Sample | # Young | # Older |
|---|---|---|---|---|---|---|
| 1 | 1000 | 800 | 200 | 20 | 10 | 10 |
| 2 | 850 | 400 | 450 | 20 | 10 | 10 |
| 3 | 150 | 110 | 40 | 20 | 10 | 10 |
For a Young person in Region (PSU) 1
First stage (PPS): \(P(\text{PSU }1\text{ selected}) = \pi_i = n \psi_i = n \frac{M_i}{\sum_{i=1}^N M_i} = 3 \times \frac{1000}{10000} = 0.3\)
Second stage (SRS within stratum): \[P(\text{Young person }j\text{ selected }| \text{ PSU }1\text{ selected}) = \pi_{j|1}^{Y} = \frac{\text{\# of Younger people selected in PSU } 1}{\text{\# of Younger people in PSU }1} = \frac{10}{800} = 0.0125\]
Overall selection probability = product of first and second stage probabilities \[P(\text{Young person }j\text{ in PSU }1\text{ selected}) = \pi_{1j}^{Y} = \pi_1 \times \pi_{j|1}^{Y} = 0.3 \times 0.0125 = 0.00375\]
Sampling Weight: \(\displaystyle w_{1j}^{Y} = \frac{1}{\pi_{1j}^{Y}} = \frac{1}{0.00375} = 266.67\)
My 3 sampled regions (PSUs) are →
Total population size = 10,000
| Region | Population | # Young | # Older | Sample | # Young | # Older |
|---|---|---|---|---|---|---|
| 1 | 1000 | 800 | 200 | 20 | 10 | 10 |
| 2 | 850 | 400 | 450 | 20 | 10 | 10 |
| 3 | 150 | 110 | 40 | 20 | 10 | 10 |
You can alternatively calculate the weights for each stage and multiply them, because (for a general case): \[\pi_{ijk} = \pi_{i} \times \pi_{j|i} \times \pi_{k|ij}\] and thus \[w_{ijk} = \frac{1}{\pi_{ijk} }= \frac{1}{\pi_{i} \times \pi_{j|i} \times \pi_{k|ij}} = \frac{1}{\pi_{i}} \times \frac{1}{\pi_{j|i}} \times \frac{1}{\pi_{k|ij}} = w_i \times w_{j|i} \times w_{k|ij}\]
For the example, Young person in Region (PSU) 1:
PSU 1: \(w_1 = \frac{1}{\pi_1} = \frac{1}{0.3} = 3.3333\)
Young person \(j\) within PSU 1: \(w_{j|1}^{Y} = \frac{1}{\pi_{j|1}^{Y}} = \frac{1}{0.0125} = 80\)
Overall weight: \(w_{1j}^{Y}=w_1\times w_{j|1}^{Y} = 3.3333 \times 80 = 266.67\) (same answer)
Suppose we have a complex design involving:
Dividing the population into \(H\) strata: \(h=1,\dots,H\)
Sampling \(n_h\) PSUs within stratum \(h\): \(i=1,\dots,n_h\)
Sampling \(m_{hi}\) SSUs within PSU \(i\) within stratum \(h\): \(j=1,\dots, m_{hi}\)
\(w_{hij}\) = weight for SSU \(j\) in PSU \(i\) in stratum \(h\)
\(y_{hij}\) = outcome for SSU \(j\) in PSU \(i\) in stratum \(h\)
Estimated total = sum up the product of the sampling weights \((w_{hij})\) and the outcome \((y_{hij})\) over all the sampled SSUs within each sampled PSU within each stratum: \[\hat{t}_{HT} = \underbrace{\sum_{h=1}^H}_{\text{strata }} \underbrace{\sum_{i=1}^{n_h}}_{\text{ PSUs }} \underbrace{\sum_{i=1}^{m_{hi}}}_{\text{ SSUs }} w_{hij} y_{hij}\]
Estimated mean would use the ratio estimator or the unbiased estimator – what are the denominators?
Weighting a Complex Survey
PUBHBIO 7225