Two-Stage Cluster Sampling Variance Derivation
PUBHBIO 7225
Derivation of \(V(\hat{t}_{clus})\) Under Two-Stage Cluster Sampling
For two-stage cluster sampling with SRS at both stages:
Variance comes from iterated variance formula and properties of SRS estimators.
Conditioning is on a realized sample, i.e., on the set of sample selection indicators for the PSUs:
\(Z_i\) = selection indicator, \(Z_i = \begin{cases} 1, & \text{if PSU $i$ is in the sample}\\ 0, & \text{otherwise} \end{cases}\)
\(E(Z_i) = E(Z_i^2) = n/N\) \[\begin{aligned} V(\hat{t}_{clus}) = V \left(\sum_{i \in \S} \frac{N}{n} \hat{t}_i \right) &= V \left(\sum_{i = 1}^N Z_i \frac{N}{n} \hat{t}_i \right) \\ &= V \left[ E\left(\sum_{i = 1}^N Z_i \frac{N}{n} \hat{t}_i \bigg| Z_1,\dots,Z_N \right) \right] + E \left[ V\left(\sum_{i = 1}^N Z_i \frac{N}{n} \hat{t}_i\bigg| Z_1,\dots,Z_N \right) \right] \\ &= V \left[ \sum_{i = 1}^N Z_i \frac{N}{n} E\left(\hat{t}_i \bigg| Z_1,\dots,Z_N \right) \right] + E \left[ \sum_{i = 1}^N Z_i^2 \frac{N^2}{n^2} V\left(\hat{t}_i\bigg| Z_1,\dots,Z_N \right) \right] \\ &= V \left[ \sum_{i = 1}^N Z_i \frac{N}{n} E\left(\hat{t}_i \right) \right] + E \left[ \sum_{i = 1}^N Z_i^2 \frac{N^2}{n^2} V\left(\hat{t}_i \right) \right] \quad \text{because } \hat{t}_i \perp (Z_1,\dots,Z_N)^*\\ &= V \left[ \sum_{i = 1}^N Z_i \frac{N}{n} t_i \right] + \sum_{i = 1}^N E(Z_i^2) \frac{N^2}{n^2} V\left(\hat{t}_i \right) \quad \text{because } E(\hat{t}_i) = t_i \\ &= N^2 V \left[ \sum_{i = 1}^N Z_i \frac{t_i}{n} \right] + \sum_{i = 1}^N \frac{n}{N} \frac{N^2}{n^2} V\left(\hat{t}_i \right) \\ &= N^2 V(\bar{t}) + \frac{N}{n} \sum_{i = 1}^N V\left(\hat{t}_i \right) \\ &= \underbrace{N^2 \left(1-\frac{n}{N}\right)\frac{S^2_t}{n}}_{\text{stage 1}} + \underbrace{\frac{N}{n} \sum_{i = 1}^N M_i^2 \left(1-\frac{m_i}{M_i} \right) \frac{S^2_i}{m_i}}_{\text{stage 2}}\\ &= \underbrace{N^2 \left(1-\frac{n}{N}\right)\frac{S^2_t}{n}}_{V_{stage1}} + \underbrace{\frac{N}{n} \sum_{i=1}^N M_i^2 \left(1-\frac{m_i}{M_i}\right) \frac{S_i^2}{m_i}}_{V_{stage2}} \end{aligned}\]
Expected Value of \(\widehat{V}_{stage1}\)
Estimate of \(V(\hat{t}_{clus})\): \[\widehat{V}(\hat{t}_{clus}) = \underbrace{N^2 \left(1-\frac{n}{N}\right)\frac{s^2_t}{n}}_{\widehat{V}_{stage1}} + \underbrace{\frac{N}{n} \sum_{i \in \S} \left(1-\frac{m_i}{M_i}\right)M_i^2 \frac{s_i^2}{m_i}}_{\widehat{V}_{stage2}}\]
To find \(E(\widehat{V}_{stage1})\), we use the fact that: \[E[s^2_t] = S^2_t + \frac{1}{N} \sum_{1=1}^N \left(1-\frac{m_i}{M_i}\right) M_i^2 \frac{S^2_i}{m_i}\]
We can see that \(E[s^2_t] \ne S^2_t\), i.e., \(s_t^2\) is not an unbiased estimator.
It is an overestimate, which makes sense since \(\hat{t}_i\) will be different with a different subsample in PSU \(i\).
Thus we have: \[\begin{aligned} E[\widehat{V}_{stage1}] = E\left[N^2 \left(1-\frac{n}{N}\right)\frac{s^2_t}{n}\right] &= N^2 \left(1-\frac{n}{N}\right)\frac{1}{n} E[s^2_t]\\ &= N^2 \left(1-\frac{n}{N}\right)\frac{1}{n} \left[S^2_t + \frac{1}{N} \sum_{1=1}^N \left(1-\frac{m_i}{M_i}\right) M_i^2 \frac{S^2_i}{m_i} \right] \\ &= N^2 \left(1-\frac{n}{N}\right)\frac{S^2_t}{n} + \frac{N}{n} \textcolor{red}{\left(1-\frac{n}{N}\right)} \sum_{1=1}^N M_i^2\left(1-\frac{m_i}{M_i}\right) \frac{S^2_i}{m_i} \end{aligned}\]
Compare to the true variance of the estimated total, \(V(\hat{t}_{clus})\): \[V(\hat{t}_{clus}) = N^2 \left(1-\frac{n}{N}\right)\frac{S^2_t}{n} + \frac{N}{n} \sum_{i=1}^N M_i^2 \left(1-\frac{m_i}{M_i}\right) \frac{S_i^2}{m_i}\]
The only difference between \(E(\widehat{V}_{stage1})\) and \(V(\hat{t}_{clus})\) is the FPC part in red above!
Thus, if \(N\) is large/sample a small fraction of PSUs, then \(\widehat{V}_{stage1}\) is approximately unbiased for \(V(\hat{t}_{clus})\)