PUBHBIO 7225 Lecture 4
Topics
Activities
Assignments
Simple random sample without replacement (SRSWOR or just SRS)
Simple random sample with replacement (SRSWR)
By definition, SRS (and SRSWR) are EPSEM sampling designs
EPSEM = Equal Probability of SElection Method = every unit has equal probability of selection
But this isn’t the whole story; other sampling designs can have this property
In order for a sampling scheme to be SRS, the following conditions are necessary:
\(U\) = the finite population (i.e., the collection of all sampling units in the population)
\(N\) = size of the population (number of units)
\(\mathcal{S}\) = a particular sample
\(n\) = size of the sample
\(y_i\) = characteristic of interest for the \(i\)th unit (outcome you’re interested in)
\(Z_i\) = selection indicator, \(Z_i = \begin{cases} 1, & \text{if unit $i$ is in the sample}\\ 0, & \text{otherwise} \end{cases}\)
\(\pi_i\) = probability of selection/inclusion probability for population unit \(i\)
\(w_i\) = sampling weight for unit \(i\)
In design-based theory for sampling, we do not assign a distribution to the outcome being measured
Instead, the \(\mathbf{y_i}\)’s are considered fixed (but unknown) – hence lower case
Randomness instead comes from the selection indicators, \(\mathbf{Z_i}\)
Selection indicator: \(Z_i \sim \text{Bernoulli}(\pi_i)\)
\(\pi_i = P(Z_i=1) = P(\)select unit \(i\) into the sample\(\displaystyle ) = \frac{n}{N}\)
This probability comes from the definition of an SRS(WOR):
There are \({N \choose n}\) possible SRS(WOR) samples
Assume unit \(i\) is in the sample of size \(n\)
The other \((n-1)\) units in the sample must come from the remaining \(N-1\) population units
Number of possible samples of size \((n-1)\) that can come from the remaining \((N-1)\) units is \({N-1 \choose n-1}\)
Therefore,
\(\pi_i = P(Z_i=1) =\)
In Activity 3.1, we were choosing \(n=2\) cats out of \(N=3\)
We calculated: \(P(\)Leo in the sample\() = 2/3 = 0.667\)
Intuitively, this would be the same for all cats (chosen with equal probability)
Thus, \(\pi_i = 2/3, i \in 1, 2, 3\) (for all cats)
This matches the formula:
Basic quantities we want to estimate:
| Quantity | Estimand (Truth) | Estimate |
|---|---|---|
| Mean | \(\displaystyle \bar{y}_U=\frac{1}{N} \sum_{i=1}^N y_i\) | \(\displaystyle \bar{y}= \frac{1}{n} \sum_{i\in \mathcal{S}} y_i\) |
| Total | \(\displaystyle t = \sum_{i=1}^N y_i\) | \(\hat{t}= N \bar{y}\) |
| Variance (of \(y\)) | \(\displaystyle S^2=\frac{1}{N-1} \sum_{i=1}^N (y_i - \bar{y}_U)^2\) | \(\displaystyle s^2=\frac{1}{n-1} \sum_{i\in \mathcal{S}} (y_i - \bar{y})^2\) |
| \(\displaystyle S^2=\frac{1}{N-1} \left(\sum_{i=1}^N y_i^2 - N \bar{y}_U^2 \right)\) | \(\displaystyle s^2=\frac{1}{n-1} \left(\sum_{i\in \mathcal{S}} y_i^2 - n\bar{y}^2 \right)\) |
Formulas apply regardless of distribution of \(y\)
\(y\) is not a random variable!
For binary \(y\), the mean of \(y\) is a proportion: \(\bar{y}_U = p\)
This is estimated with the sample proportion, \(\hat{p}\)
The variance of \(y\) can be rewritten in terms of \(\hat{p}\): \[\begin{flalign*} \text{Binary }y: \quad S^2 &= \frac{1}{N-1} \sum_{i=1}^N (y_i-\bar{y}_U)^2 = \frac{1}{N-1} \sum_{i=1}^N (y_i-p)^2 &\\ &= %\frac{1}{N-1} \left( \sum_{i=1}^N y_i^2 - 2p\sum_{i=1}^N y_i + Np^2 \right)\\ %&= \frac{Np - 2Np^2 + Np^2}{N-1} \qquad \text{b/c } \sum_{i=1}^N y_i^2 = \sum_{i=1}^N y_i = Np\\ %&=\frac{Np - Np^2}{N-1} = \frac{Np(1-p)}{N-1}= \frac{N}{N-1}p(1-p) \end{flalign*}\]
Basic quantities we want to estimate:
| Quantity | Estimand (Truth) | Estimate |
|---|---|---|
| Mean | \(\displaystyle \bar{y}_U=\frac{1}{N} \sum_{i=1}^N y_i\) | \(\displaystyle \bar{y}= \frac{1}{n} \sum_{i\in \mathcal{S}} y_i\) |
| Total | \(\displaystyle t = \sum_{i=1}^N y_i\) | \(\hat{t}= N \bar{y}\) |
| Variance (of \(y\)) | \(\displaystyle S^2=\frac{1}{N-1} \sum_{i=1}^N (y_i - \bar{y}_U)^2\) | \(\displaystyle s^2=\frac{1}{n-1} \sum_{i\in \mathcal{S}} (y_i - \bar{y})^2\) |
| Mean, binary \(y\) (proportion) | \(p\) | \(\displaystyle \hat{p}= \frac{1}{n} \sum_{i\in \mathcal{S}} y_i\) |
| Variance, binary \(y\) | \(\displaystyle S^2 = \frac{N}{N-1} p(1-p)\) | \(\displaystyle s^2 = \frac{n}{n-1} \hat{p}(1-\hat{p})\) |
Remember: \(y\) is not a random variable!
Random selection of units into the sample implies that \((\bar{y}, \hat{t}, \hat{p})\) are random variables
| Estimate | Expected Value | Variance | Variance Estimate |
|---|---|---|---|
| Mean \((\bar{y})\) | \(E(\bar{y})= \bar{y}_U\) | \(\displaystyle V(\bar{y})= \left(1-\frac{n}{N}\right)\frac{S^2}{n}\) | \(\displaystyle \widehat{V}(\bar{y})= \left(1-\frac{n}{N}\right)\frac{s^2}{n}\) |
| Total \((\hat{t})\) | \(E(\hat{t})=t\) | \(\displaystyle V(\hat{t})=N^2 \left(1-\frac{n}{N}\right)\frac{S^2}{n}\) | \(\displaystyle \widehat{V}(\hat{t})=N^2 \left(1-\frac{n}{N}\right)\frac{s^2}{n}\) |
| Proportion \((\hat{p})\) | \(E(\hat{p})=p\) | \(\displaystyle V(\hat{p})= \left(\frac{N-n}{N-1}\right) \frac{p(1-p)}{n}\) | \(\displaystyle \widehat{V}(\hat{p}) = \left(1-\frac{n}{N}\right)\frac{\hat{p}(1-\hat{p})}{n-1}\) |
| Variance \((s^2)\) | \(E(s^2)=S^2\) | (we will not cover this) |
These expectations and variances come from the \(Z_i\)
\(y_i\) is considered fixed
In your previous statistics classes: \(\displaystyle V(\bar{y}) = \frac{S^2}{n}\)
In this class: \(\displaystyle V(\bar{y})= \textcolor{blue}{\left(1-\frac{n}{N}\right)} \frac{S^2}{n}\)
Extra piece: \(\displaystyle \left(1-\frac{n}{N}\right)\) = finite population correction (fpc)
When you sample a large proportion of the population without replacement, you get a reduction in variance
If you sample all of the population (i.e., \(n = N\)), your information is “perfect”
In that case, \(V(\bar{y}) = 0\), since every “sample” of size \(n=N\) produces the same \(\bar{y}\)!
For samples that are small relative to the population size \((n \ll N)\), fpc \(\approx\) 1 (and can be ignored)
Many surveys have target populations that are very large
Even the U.S. government doesn’t have the resources to sample a large fraction of many populations of interest!
Example: National surveys conducted by U.S. government (e.g., NHANES, BRFSS)
Target population = all U.S. residents (sometimes exclude small subsets, e.g., incarcerated people)
US adult population \(\approx\) 250,000,000
Even if sample \(n =\) 100,000 people, that’s only 0.04% of the population!
fpc = \(1 - \frac{n}{N} = 1-\frac{100,000}{250,000,000} = 1 - 0.0004 = 0.9996 \approx 1\)
\(V(\bar{y})=\left(1-\frac{n}{N}\right)\frac{S^2}{n} \approx \frac{S^2}{n}\)
Not going to get much variance reduction due to fpc!
The Effect of the FPC
Re-write \(\bar{y}\) in terms of \(Z_i\) (and over the whole population): \[\begin{align} \bar{y}= \sum_{i \in \mathcal{S}} \frac{y_i}{n} &= \sum_{i \in \mathcal{S}} \left(1 \times \frac{y_i}{n}\right) + \sum_{i \notin \mathcal{S}} \left(0 \times \frac{y_i}{n}\right) = \sum_{i=1}^N Z_i \frac{y_i}{n}\\ & \\ \end{align}\] Since \(Z_i \sim Bernoulli(\pi_i=\frac{n}{N})\), we have \(E(Z_i) =\pi_i = \frac{n}{N}\)
So we can find the expected value and variance of the sample mean \(\bar{y}\) as:
\(\displaystyle E(\bar{y}) = E\left(\sum_{i=1}^N Z_i \frac{y_i}{n}\right) =\)
To find \(V(\bar{y})\) we will need both \(V(Z_i)\) and \(\text{Cov}(Z_i,Z_j)\)
Since \(Z_i^2=Z_i\) (why?), we have that: \(\displaystyle E(Z_i^2)=E(Z_i) = \frac{n}{N}\)
Thus, \[\begin{aligned}
V(Z_i) &= E(Z_i^2) - [E(Z_i)]^2 = \frac{n}{N} - \left(\frac{n}{N}\right)^2 = \frac{n}{N} \left(1-\frac{n}{N}\right)= \pi_i(1-\pi_i) &
\end{aligned}\] (which is just the usual Bernoulli variance)
Since \(\text{Cov}(Z_i,Z_j) = E(Z_iZ_j) - E(Z_i)E(Z_j)\) we just need to find \(E(Z_iZ_j)\):
Note that \(Z_i\) and \(Z_j\) are not independent – if we know that unit \(j\) is in the sample, we know something about whether unit \(i\) is in the sample
\(\displaystyle E(Z_iZ_j) =\)
\[\begin{flalign} \text{Cov}(Z_i,Z_j) &= E(Z_iZ_j) - E(Z_i)E(Z_j) =\frac{n-1}{N-1}\frac{n}{N} - \frac{n}{N} \frac{n}{N} = \dots \text{painful algebra} \dots& \\ %&=\frac{n}{N} \left[\frac{n-1}{N-1}-\frac{n}{N} \right] =\frac{n}{N} \left[\frac{(n-1)N-n(N-1)}{N(N-1)} \right] = \frac{n}{N} \left[\frac{nN-N-nN+n}{N(N-1)} \right]\\ %&=\frac{n}{N} \left[\frac{n-N}{N(N-1)} \right] = \frac{n}{N} \left[\frac{-(N-n)}{N(N-1)} \right] = \frac{n}{N} \left(\frac{1}{N-1}\right) \left[\frac{-(N-n)}{N} \right]\\ %&= -\frac{n}{N} \left(\frac{1}{N-1}\right) \left(\frac{N-n}{N} \right) \\ &= -\frac{n}{N} \left(\frac{1}{N-1}\right) \left(1-\frac{n}{N}\right) \end{flalign}\]
This negative covariance between \(Z_i\) and \(Z_j\) is source of the finite population correction
\[\begin{flalign} V(\bar{y}) &= V \left( \sum_{i=1}^N Z_i \frac{y_i}{n} \right) = \frac{1}{n^2} V \left( \sum_{i=1}^N Z_i y_i \right)=\frac{1}{n^2} \text{Cov}\left(\sum_{i=1}^N Z_i y_i, \sum_{j=1}^N Z_j y_j \right) \\ &=\frac{1}{n^2} \sum_{i=1}^N \sum_{j=1}^N y_i y_j \text{Cov}(Z_i,Z_j) = \frac{1}{n^2} \left[\sum_{i=1}^N y_i^2 V(Z_i) + \sum_{i=1}^N \sum_{j \ne i}^N y_i y_j \text{Cov}(Z_i,Z_j) \right] \\ &= \frac{1}{n^2} \left[\frac{n}{N} \left(1-\frac{n}{N}\right)\sum_{i=1}^N y_i^2 + \left[-\frac{n}{N} \left(\frac{1}{N-1}\right) \left(1-\frac{n}{N}\right)\right] \sum_{i=1}^N \sum_{j \ne i}^N y_i y_j \right] \\ %&= \frac{1}{n^2} \frac{n}{N} \fpcpar \left[ \sum_{i=1}^N y_i^2 - \frac{1}{N-1} \sum_{i=1}^N \sum_{j \ne i}^N y_i y_j \right] \\ %&= \frac{1}{n}\frac{1}{N} \fpcpar \frac{1}{N-1} \left[ (N-1) \sum_{i=1}^N y_i^2 - \sum_{i=1}^N \sum_{j \ne i}^N y_i y_j \right]\\ %&= \frac{1}{n}\frac{1}{N} \fpcpar \frac{1}{N-1} \left[ (N-1) \sum_{i=1}^N y_i^2 - \left[ \left(\sum_{i=1}^N y_i\right)^2 - \sum_{i=1}^N y_i^2 \right]\right] \\ %&= \frac{1}{n}\frac{1}{N} \fpcpar \frac{1}{N-1} \left[N\sum_{i=1}^N y_i^2 - \left(\sum_{i=1}^N y_i\right)^2 \right] \\ %&= \frac{1}{n}\frac{1}{N} \fpcpar \frac{1}{N-1} \left[N\sum_{i=1}^N y_i^2 - \left(N \ybar_U \right)^2 \right] \\ %&= \frac{1}{n}\fpcpar \frac{1}{N-1} \left[\sum_{i=1}^N y_i^2 - N \ybar_U^2 \right] \\ %&= \frac{1}{n}\fpcpar S^2 \\ &= \dots \text{painful algebra and some tricks} \dots = \left(1-\frac{n}{N}\right)\frac{S^2}{n} \end{flalign}\]
If \(n\), \(N\), and \((N - n)\) are “sufficiently large”, then for an unbiased estimator \(\hat{\theta}\), an approximately \(\alpha\)-level confidence interval for the true value \(\theta\) is given by: \[\left( \hat{\theta} - z_{\alpha/2} \sqrt{\hat{V}(\hat{\theta})}, \quad \hat{\theta} + z_{\alpha/2} \sqrt{\hat{V}(\hat{\theta})} \right)\] where \(z_{\alpha/2}\) is the \((1-\alpha/2)\)th percentile of the standard normal distribution
For estimating the mean: \[\left( \bar{y}- z_{\alpha/2} \sqrt{\left(1-\frac{n}{N}\right)\frac{s^2}{n}}, \quad \bar{y}+ z_{\alpha/2} \sqrt{\left(1-\frac{n}{N}\right)\frac{s^2}{n}} \right)\]
In practice, often use \(t_{\alpha/2,n-1}\), the \((1-\alpha/2)\)th percentile of a \(t\) distribution with \(n-1\) degrees of freedom, for \(z_{\alpha/2}\) (\(t\) sometimes has better properties than \(z\))
For large samples, \(t_{\alpha/2,n-1} \approx z_{\alpha/2}\)
In smaller samples, using \(t_{\alpha/2,n-1}\) produces a wider CI
For any sampling design, define:
Sampling weight = \(\displaystyle w_i = \frac{1}{\pi_i}\) = inverse probability of selection
Horvitz-Thompson (HT) estimator (of the total): \[\hat{t}_{HT} = \sum_{i\in \mathcal{S}} \frac{y_i}{\pi_i} = \sum_{i\in \mathcal{S}} w_i y_i\]
Inclusion probability \(\displaystyle \pi_i = \frac{n}{N}\) for all \(i\)
Sampling weight = \(\displaystyle w_i = \frac{1}{\pi_i} = \frac{N}{n}\)
Estimator for the total of \(y\): \[\hat{t}_{HT} = \sum_{i\in \mathcal{S}} \textcolor{red}{w_i} y_i = \sum_{i\in \mathcal{S}} \textcolor{red}{\frac{1}{\pi_i}} y_i = \sum_{i\in \mathcal{S}} \textcolor{red}{\frac{N}{n}} y_i = N \frac{1}{n} \sum_{i\in \mathcal{S}} y_i = N \bar{y} \]
Note that the sum of weights equals the population size: \(\displaystyle \sum_{i \in \mathcal{S}} w_i = \sum_{i \in \mathcal{S}} \frac{N}{n} = n \frac{N}{n} = N\)
Our estimator for the mean of \(y\) can also be written as a function of the weights: \[\bar{y}= \frac{1}{n} \sum_{i\in \mathcal{S}} y_i = \frac{1}{N} \sum_{i\in \mathcal{S}} \frac{N}{n} y_i = \frac{\sum_{i\in \mathcal{S}} \frac{N}{n} y_i}{N} = \frac{\sum_{i\in \mathcal{S}} w_i y_i}{\sum_{i \in \mathcal{S}} w_i}\]
For most surveys, sample size is based on being able to estimate a population parameter (mean, proportion, total) with a specified precision
This is in contrast to, say, a clinical trial, where we want to detect a certain effect size
Precision is specified via the Margin of Error (MOE):
Margin of error (MOE) = \(e\) = half the width of a CI
\(P(|\bar{y}-\bar{y}_U| \leq e) = 1 - \alpha\)
From the CI formula we have: \(\displaystyle e = z_{\alpha/2} \sqrt{\left(1-\frac{n}{N}\right)\frac{S^2}{n}}\)
Solving for \(n\) we get: \(\displaystyle n = \frac{z_{\alpha/2}^2 S^2}{e^2 + \dfrac{z_{\alpha/2}^2 S^2}{N}}\)
If \(N\) is very large, reduces to formula for an SRSWR (typical, infinite population calculation)
Typical application of the sample size formula
Estimating a proportion
Assume true proportion is the one that yields the largest variance, which is \(p=\)
Assume MOE of \(e=0.03\)
Assume very large population (so can ignore the FPC)
Resulting calculation yields:
\(\displaystyle n = \frac{z_{\alpha/2}^2 S^2}{e^2 + \dfrac{z_{\alpha/2}^2 S^2}{N}} =\)
Choosing a sample size is somewhat like deciding how much food to take on a picnic. You have a rough idea of how many people will attend, but do not know how much food you should have brought until after the picnic is over. You also need to bring extra food to allow for unexpected happenings, such as 2-year-old Freddie feeding a bowl of potato salad to the ducks or cousin Ted bringing along some extra guests. But you do not want to bring too much extra food, or it will spoil and you will have wasted money. Of course, the more picnics you have organized, and the better acquainted you are with the picnic guests, the better you become at bringing the right amount of food. It is comforting to know that the same is true of determining sample sizes — experience and knowledge about the population make you much better at designing surveys.
Quote from p.50 of Sampling: Design and Analysis, 2nd Edition by Sharon Lohr
PUBHBIO 7225