PUBHBIO 7225 Lecture 16
Topics
Domain Estimation
Chi-Square Tests
Discussion of Individual Project (time-permitting)
Activities
Readings
Assignments
Domain = a subgroup in the population that is not part of the sampling design.
Also called subpopulations or subgroups or subclasses.
Not a stratum, not a cluster/group of clusters
Domains can cross strata (and often/usually do)
Domains can include units from many clusters (but maybe not all clusters)
Like strata and clusters, each unit \(i\) either belongs to a certain domain or does not belong to the domain
Unlike strata and clusters, we don’t know what domain unit \(i\) belongs to until it is sampled (i.e., record domain membership on the survey)
Examples:
SRS of OSU students on Columbus campus, domain = students who live off-campus
Stratified (by county) random sample of women who recently gave birth in Ohio, domain = women who did not have a prenatal visit
Two-stage cluster sample of students within schools, domain = students with a disclosed disability
Key Point: The number of units in the domain in a sample is a random variable
Take another sample, get a different number of units in the domain
Formula for the variance of a domain total incorporates this sample-to-sample variability
Statistical software must recognize (be told about / have info about) all the original design strata and PSUs – even if some don’t include any domain members
Thus, you must never subset your data for analysis
Subsetting will definitely cause problems when:
A stratum does not contain any members of the domain
A PSU does not contain any members of a domain
Indicator variable, \(x_i\), for domain membership: \[x_i = \begin{cases} 1 & \text{unit $i$ is in domain $d$}\\ 0 & \text{unit $i$ not in domain $d$} \end{cases}\]
Estimates:
Domain total: \(\displaystyle \hat{t}^{(d)} = \sum_{i \in \mathcal{S}} w_i x_i y_i\)
Domain mean: \(\displaystyle \bar{y}^{(d)} = \frac{\sum_{i \in \mathcal{S}} w_i x_i y_i}{\sum_{i \in \mathcal{S}} w_i x_i}\)
Same point estimate as if you dropped the non-domain units (\(x_i=0\) for units not in the domain)
But variance estimation would be incorrect
You must use the whole sample for estimation
Analysis of data from the 2004 National Hospital Ambulatory Medical Care Survey (paper on Carmen)
NHAMCS collects data on the use and provision of ambulatory care services in hospital emergency departments (ED) and outpatient departments and ambulatory surgery locations
4-stage design with stratification:
PSUs = geographic areas (selection via PPS) (stratified by SES variables)
hospitals within PSUs (selection via PPS)
clinics within hospitals (complex selection procedure involving PPS)
patient visits within clinics (systematic sampling)
2004 sample had 8 strata and 294 PSUs (between 6 and 86 PSUs per stratum)
Domain of interest: visits to EDs by elderly (age \(\geq\) 60) African-American males
Goal: estimate the percentage of visits in this domain with dizziness or vertigo as a reason for the visit
Two analysis approaches taken:
Conditional Approach: Subsetting the data to domain members only and performing estimation (bad!)
Unconditional Approach: Appropriately including all sampled units in calculations (good!)
| Unconditional Approach (Correct) | Conditional Approach (Incorrect) | When subsetting, we have… | |
|---|---|---|---|
| Sample Size | 68,372 | 397 | |
| Subpopulation Sample Size | 397 | 397 | |
| Design Strata | 8 | 8 | Correct # strata |
| Design Clusters | 294 | 114 | Too few PSUs |
| Design DF (# PSUs − # Strata) | 286 | 106 | DF too small |
| Estimated Percentage | 4.8201 | 4.8201 | Correct point estimate |
| Standard Error | 1.5904 | 1.5761 | SE too small |
| 95% CI | (1.6897, 7.9504) | (1.6954, 7.9448) | CI too narrow |
Adapted from Table 1 in West et al. (2008)
Stata:
Stata user warning: do not use if with Stata for domain estimates!
R:
Domain Estimation and Comparison (Part 1)
So far in this course we have done a lot of estimating of means and totals and proportions, either for the whole population or (today) for domains
We have yet to do any formal comparisons – i.e., any hypothesis testing
If we want to compare two (or more) domains, often we will be using chi-square tests – since so much of survey data is categorical
I am going to assume you are familiar with (and comfortable with) chi-square tests in the context of infinite populations (i.e., what you’ve seen before this class), but stop me if you have questions!
We will start with an example that illustrates why taking into account the survey design is important when doing tests like a chi-square test
Suppose we have an SRS of \(n\) = 500 co-habiting couples from a very large population (i.e., \(N\) large enough to be able to ignore the FPC)
We want to compare two domains: People who have Netflix and People who do not have Netflix
Question: Are people who have Netflix less likely to subscribe to cable TV?
(We recognize that no causal link can be made with a cross-sectional survey!)
netflix Cable TV No Cable TV Total
Has Netflix 57.49% (119) 42.51% (88) 100.00% (207)
No Netflix 64.16% (188) 35.84% (105) 100.00% (293)
Resulting estimates:
119/207 = 57.5% of couples with Netflix have cable TV
188/293 = 64.2% of couples without Netflix have cable TV
Notice that the number of people with (and without) Netflix would be different if we took another sample – random domain sample size!
Pearson's Chi-squared test
data: dat$netflix and dat$cable
X-squared = 2.281, df = 1, p-value = 0.131
Conclusion: There is no evidence of an association between having Netflix and having cable
(i.e., no evidence that the proportion of people with cable differs for people with/without Netflix)
Summarize observed association with an odds ratio (OR) or prevalence ratio (PR):
Surveys are (usually) cross-sectional → prefer PR to RR (risk ratio)
Now, suppose we asked both partners about Netflix and cable
netflix Cable TV No Cable TV Total
Has Netflix 57.49% (238) 42.51% (176) 100.00% (414)
No Netflix 64.16% (376) 35.84% (210) 100.00% (586)
Pearson's Chi-squared test
data: dat_both$netflix and dat_both$cable
X-squared = 4.5621, df = 1, p-value = 0.03269
Ignoring the clustering has inflated the test statistic value and lowered the p-value
ICC for this design is very high, in fact it is:
What is the DEFF for this design?
Ignoring clustering can inflate test statistics and deflate p-values
Remember, DEFF > 1 for many outcomes in a cluster sample
A cluster sample of size \(n\) provides the same precision as an SRS of size \(n\)/DEFF
Thus you are overstating your signficance (p-value too small)
Ignoring stratification can deflate test statistics and inflate p-values
Remember, DEFF < 1 for many outcomes in a stratified sample
A stratified sample of size \(n\) provides the same precision as an SRS of size \(n\)/DEFF
Thus you are being too conservative (p-value too high)
However, we can adjust the test statistic \(X^2\) in order to make a rescaled test statistic have an approximately \(\chi^2\) distribution
Two methods used most often (not very creatively named):
First Order Correction (also called Rao-Scott first order correction)
Second Order Correction (also called Rao-Scott second order correction)
Idea: Match the mean of the test statistic \(X^2\) (under the complex design) to the mean of a \(\chi^2_{(r-1)(c-1)}\) distribution, thus “rescaling” the test statistic
“First Order” because we’re rescaling based on the mean
Calculate \(E[X^2]\) = the expected value of \(X^2\) for the complex design under \(H_0\)
Calculate first order corrected test statistic as \(X^2_F = \frac{(r-1)(c-1) X^2}{E[X^2]}\)
Compare \(X^2_F\) to a \(\chi^2_{(r-1)(c-1)}\) distribution
Slight problem: we are matching the mean of the distribution – but p-values often come from the tail of the distribution (at least the p-values we care about most!)
Idea: Match the mean and variance of the test statistic \(X^2\) (under the complex design) to the mean and variance of a \(\chi^2_{(r-1)(c-1)}\) distribution, thus “rescaling” the test statistic
Second Order because we’re rescaling based on the mean and variance
Calculate \(E[X^2]\) = the expected value of \(X^2\) for the complex design under \(H_0\)
Calculate \(V[X^2]\) = the variance of \(X^2\) under \(H_0\)
Calculate second order corrected test statistic as \(X^2_S = \frac{\nu X^2_F}{(r-1)(c-1)}\) where \(\nu = 2\frac{(E[X^2])^2}{V[X^2]}\)
Compare \(X^2_F\) to a \(\chi^2_{\nu}\) distribution
Calculation of \(V[X^2]\) even more involved than \(E[X^2]\), requires covariance matrix of estimated cell proportions
For many software packages (Stata, R), the second order correction is the default
Domain Estimation and Comparison (Part 2)
PUBHBIO 7225