Two-Stage Cluster Sampling Variance Derivation

PUBHBIO 7225

Derivation of $V(\hat{t}_{clus})$ Under Two-Stage Cluster Sampling

For two-stage cluster sampling with SRS at both stages:

Variance comes from iterated variance formula and properties of SRS estimators.

Conditioning is on a realized sample, i.e., on the set of sample selection indicators for the PSUs:

$Z_i$ = selection indicator, $Z_i = \begin{cases} 1, & \text{if PSU $i$ is in the sample}\\ 0, & \text{otherwise} \end{cases}$

$E(Z_i) = E(Z_i^2) = n/N$ \[\begin{aligned} V(\hat{t}_{clus}) = V \left(\sum_{i \in \S} \frac{N}{n} \hat{t}_i \right) &= V \left(\sum_{i = 1}^N Z_i \frac{N}{n} \hat{t}_i \right) \\ &= V \left[ E\left(\sum_{i = 1}^N Z_i \frac{N}{n} \hat{t}_i \bigg| Z_1,\dots,Z_N \right) \right] + E \left[ V\left(\sum_{i = 1}^N Z_i \frac{N}{n} \hat{t}_i\bigg| Z_1,\dots,Z_N \right) \right] \\ &= V \left[ \sum_{i = 1}^N Z_i \frac{N}{n} E\left(\hat{t}_i \bigg| Z_1,\dots,Z_N \right) \right] + E \left[ \sum_{i = 1}^N Z_i^2 \frac{N^2}{n^2} V\left(\hat{t}_i\bigg| Z_1,\dots,Z_N \right) \right] \\ &= V \left[ \sum_{i = 1}^N Z_i \frac{N}{n} E\left(\hat{t}_i \right) \right] + E \left[ \sum_{i = 1}^N Z_i^2 \frac{N^2}{n^2} V\left(\hat{t}_i \right) \right] \quad \text{because } \hat{t}_i \perp (Z_1,\dots,Z_N)^*\\ &= V \left[ \sum_{i = 1}^N Z_i \frac{N}{n} t_i \right] + \sum_{i = 1}^N E(Z_i^2) \frac{N^2}{n^2} V\left(\hat{t}_i \right) \quad \text{because } E(\hat{t}_i) = t_i \\ &= N^2 V \left[ \sum_{i = 1}^N Z_i \frac{t_i}{n} \right] + \sum_{i = 1}^N \frac{n}{N} \frac{N^2}{n^2} V\left(\hat{t}_i \right) \\ &= N^2 V(\bar{t}) + \frac{N}{n} \sum_{i = 1}^N V\left(\hat{t}_i \right) \\ &= \underbrace{N^2 \left(1-\frac{n}{N}\right)\frac{S^2_t}{n}}_{\text{stage 1}} + \underbrace{\frac{N}{n} \sum_{i = 1}^N M_i^2 \left(1-\frac{m_i}{M_i} \right) \frac{S^2_i}{m_i}}_{\text{stage 2}}\\ &= \underbrace{N^2 \left(1-\frac{n}{N}\right)\frac{S^2_t}{n}}_{V_{stage1}} + \underbrace{\frac{N}{n} \sum_{i=1}^N M_i^2 \left(1-\frac{m_i}{M_i}\right) \frac{S_i^2}{m_i}}_{V_{stage2}} \end{aligned}\]

Expected Value of $\widehat{V}_{stage1}$

Estimate of $V(\hat{t}_{clus})$: \[\widehat{V}(\hat{t}_{clus}) = \underbrace{N^2 \left(1-\frac{n}{N}\right)\frac{s^2_t}{n}}_{\widehat{V}_{stage1}} + \underbrace{\frac{N}{n} \sum_{i \in \S} \left(1-\frac{m_i}{M_i}\right)M_i^2 \frac{s_i^2}{m_i}}_{\widehat{V}_{stage2}}\]

To find $E(\widehat{V}_{stage1})$, we use the fact that: \[E[s^2_t] = S^2_t + \frac{1}{N} \sum_{1=1}^N \left(1-\frac{m_i}{M_i}\right) M_i^2 \frac{S^2_i}{m_i}\]

We can see that $E[s^2_t] \ne S^2_t$, i.e., $s_t^2$ is not an unbiased estimator.

It is an overestimate, which makes sense since $\hat{t}_i$ will be different with a different subsample in PSU $i$.

Thus we have: \[\begin{aligned} E[\widehat{V}_{stage1}] = E\left[N^2 \left(1-\frac{n}{N}\right)\frac{s^2_t}{n}\right] &= N^2 \left(1-\frac{n}{N}\right)\frac{1}{n} E[s^2_t]\\ &= N^2 \left(1-\frac{n}{N}\right)\frac{1}{n} \left[S^2_t + \frac{1}{N} \sum_{1=1}^N \left(1-\frac{m_i}{M_i}\right) M_i^2 \frac{S^2_i}{m_i} \right] \\ &= N^2 \left(1-\frac{n}{N}\right)\frac{S^2_t}{n} + \frac{N}{n} \textcolor{red}{\left(1-\frac{n}{N}\right)} \sum_{1=1}^N M_i^2\left(1-\frac{m_i}{M_i}\right) \frac{S^2_i}{m_i} \end{aligned}\]

Compare to the true variance of the estimated total, $V(\hat{t}_{clus})$: \[V(\hat{t}_{clus}) = N^2 \left(1-\frac{n}{N}\right)\frac{S^2_t}{n} + \frac{N}{n} \sum_{i=1}^N M_i^2 \left(1-\frac{m_i}{M_i}\right) \frac{S_i^2}{m_i}\]

The only difference between $E(\widehat{V}_{stage1})$ and $V(\hat{t}_{clus})$ is the FPC part in red above!

Thus, if $N$ is large/sample a small fraction of PSUs, then $\widehat{V}_{stage1}$ is approximately unbiased for $V(\hat{t}_{clus})$

Two-Stage Cluster Sampling Variance Derivation

Derivation of \(V(\hat{t}_{clus})\) Under Two-Stage Cluster Sampling

Expected Value of \(\widehat{V}_{stage1}\)