Domain Estimation

PUBHBIO 7225 Lecture 16

Outline

Topics

Domain Estimation
Chi-Square Tests
Discussion of Individual Project (time-permitting)

Activities

16.1 Domain Estimation/Comparison

Readings

West BT, Berglund P, Heeringa SG (2008). A closer examination of subpopulation analysis of complex-sample survey data. The Stata Journal, 8(4); 520-531. (PDF on Carmen)

Assignments

Problem Set 4 due Thursday 10/23/2025 11:59pm via Carmen

Estimation in Domains

Domain = a subgroup in the population that is not part of the sampling design.
Also called subpopulations or subgroups or subclasses.

Not a stratum, not a cluster/group of clusters
- Domains can cross strata (and often/usually do)
- Domains can include units from many clusters (but maybe not all clusters)

Like strata and clusters, each unit $i$ either belongs to a certain domain or does not belong to the domain
Unlike strata and clusters, we don’t know what domain unit $i$ belongs to until it is sampled (i.e., record domain membership on the survey)

Examples:
- SRS of OSU students on Columbus campus, domain = students who live off-campus
- Stratified (by county) random sample of women who recently gave birth in Ohio, domain = women who did not have a prenatal visit
- Two-stage cluster sample of students within schools, domain = students with a disclosed disability

Domain Sample Sizes are Random Variables

Key Point: The number of units in the domain in a sample is a random variable
- Take another sample, get a different number of units in the domain
  - Contrast with a stratum – take another sample, get the same number of units in the stratum!

Formula for the variance of a domain total incorporates this sample-to-sample variability
- Statistical software must recognize (be told about / have info about) all the original design strata and PSUs – even if some don’t include any domain members
- Thus, you must never subset your data for analysis

Subsetting will definitely cause problems when:
- A stratum does not contain any members of the domain
  - Usually unlikely to occur – but possible in highly stratified designs
- A PSU does not contain any members of a domain
  - Quite possible – imagine sampling households, and the domain is males aged 50+

One exception: It’s okay to subset an SRS – an SRS of a population contains SRS of a subpopulation

Domain Indicators

Indicator variable, $x_i$, for domain membership: \[x_i = \begin{cases} 1 & \text{unit $i$ is in domain $d$}\\ 0 & \text{unit $i$ not in domain $d$} \end{cases}\]
Estimates:
- Domain total: $\displaystyle \hat{t}^{(d)} = \sum_{i \in \mathcal{S}} w_i x_i y_i$
- Domain mean: $\displaystyle \bar{y}^{(d)} = \frac{\sum_{i \in \mathcal{S}} w_i x_i y_i}{\sum_{i \in \mathcal{S}} w_i x_i}$

Same point estimate as if you dropped the non-domain units ($x_i=0$ for units not in the domain)
But variance estimation would be incorrect
You must use the whole sample for estimation

Example: Domain Estimation in NHAMCS

Analysis of data from the 2004 National Hospital Ambulatory Medical Care Survey (paper on Carmen)
NHAMCS collects data on the use and provision of ambulatory care services in hospital emergency departments (ED) and outpatient departments and ambulatory surgery locations
4-stage design with stratification:
1. PSUs = geographic areas (selection via PPS) (stratified by SES variables)
2. hospitals within PSUs (selection via PPS)
3. clinics within hospitals (complex selection procedure involving PPS)
4. patient visits within clinics (systematic sampling)
2004 sample had 8 strata and 294 PSUs (between 6 and 86 PSUs per stratum)

Domain of interest: visits to EDs by elderly (age $\geq$ 60) African-American males
- Domain members could theoretically appear across the strata and PSUs – but there are likely PSUs with no domain members
Goal: estimate the percentage of visits in this domain with dizziness or vertigo as a reason for the visit

Example (con’t)

Two analysis approaches taken:
1. Conditional Approach: Subsetting the data to domain members only and performing estimation (bad!)
2. Unconditional Approach: Appropriately including all sampled units in calculations (good!)

	Unconditional Approach (Correct)	Conditional Approach (Incorrect)	When subsetting, we have…
Sample Size	68,372	397
Subpopulation Sample Size	397	397
Design Strata	8	8	Correct # strata
Design Clusters	294	114	Too few PSUs
Design DF (# PSUs − # Strata)	286	106	DF too small
Estimated Percentage	4.8201	4.8201	Correct point estimate
Standard Error	1.5904	1.5761	SE too small
95% CI	(1.6897, 7.9504)	(1.6954, 7.9448)	CI too narrow

Adapted from Table 1 in West et al. (2008)

Domain Estimation in Software

Stata:

# estimates just for the domain of interest -- units with DOMAIN not equal to 0
svy, subpop(DOMAIN): mean y
svy, subpop(DOMAIN): proportion y

# estimates in all domains defined by the unique values of DOMAIN
svy: mean y, over(DOMAIN)
svy: proportion y, over(DOMAIN)

Stata user warning: do not use if with Stata for domain estimates!
- This will subset the data (even if you used the svyset command on the full dataset)

# estimates just for the domain of interest -- units with DOMAIN = 1/TRUE
svymean(~y, subset(DESIGN_OBJECT, DOMAIN==1))

# estimates in all domains defined by the unique values of DOMAIN
svyby(~y, ~DOMAIN, DESIGN_OBJECT, svymean)

Activity 16.1 (Part 1)

Domain Estimation and Comparison (Part 1)

Comparing Domains

So far in this course we have done a lot of estimating of means and totals and proportions, either for the whole population or (today) for domains
We have yet to do any formal comparisons – i.e., any hypothesis testing
If we want to compare two (or more) domains, often we will be using chi-square tests – since so much of survey data is categorical

I am going to assume you are familiar with (and comfortable with) chi-square tests in the context of infinite populations (i.e., what you’ve seen before this class), but stop me if you have questions!
We will start with an example that illustrates why taking into account the survey design is important when doing tests like a chi-square test

Example: Households With and Without Netflix

Suppose we have an SRS of $n$ = 500 co-habiting couples from a very large population (i.e., $N$ large enough to be able to ignore the FPC)
We want to compare two domains: People who have Netflix and People who do not have Netflix
Question: Are people who have Netflix less likely to subscribe to cable TV?
(We recognize that no causal link can be made with a cross-sectional survey!)

Data from our SRS of 500 co-habiting couples:

     netflix     Cable TV  No Cable TV         Total
 Has Netflix 57.49% (119) 42.51%  (88) 100.00% (207)
  No Netflix 64.16% (188) 35.84% (105) 100.00% (293)

Resulting estimates:
- 119/207 = 57.5% of couples with Netflix have cable TV
- 188/293 = 64.2% of couples without Netflix have cable TV
Notice that the number of people with (and without) Netflix would be different if we took another sample – random domain sample size!

Example (con’t)

A chi-square test gives us:

chisq.test(dat$netflix, dat$cable, correct = FALSE)


    Pearson's Chi-squared test

data:  dat$netflix and dat$cable
X-squared = 2.281, df = 1, p-value = 0.131

Conclusion: There is no evidence of an association between having Netflix and having cable
(i.e., no evidence that the proportion of people with cable differs for people with/without Netflix)
Summarize observed association with an odds ratio (OR) or prevalence ratio (PR):

Surveys are (usually) cross-sectional → prefer PR to RR (risk ratio)

Example (con’t)

Now, suppose we asked both partners about Netflix and cable
- We’ll have twice as many people (1000 responses, instead of 500), with the same proportions (assuming both partners answered the same), but a different test result:

     netflix     Cable TV  No Cable TV         Total
 Has Netflix 57.49% (238) 42.51% (176) 100.00% (414)
  No Netflix 64.16% (376) 35.84% (210) 100.00% (586)

chisq.test(dat_both$netflix, dat_both$cable, correct = FALSE)


    Pearson's Chi-squared test

data:  dat_both$netflix and dat_both$cable
X-squared = 4.5621, df = 1, p-value = 0.03269

Ignoring the clustering has inflated the test statistic value and lowered the p-value
ICC for this design is very high, in fact it is:
What is the DEFF for this design?

Chi-Square Tests for Complex Survey Data

Ignoring clustering can inflate test statistics and deflate p-values
- Remember, DEFF > 1 for many outcomes in a cluster sample
- A cluster sample of size $n$ provides the same precision as an SRS of size $n$/DEFF
  - Effective sample size is smaller than $n$
- Thus you are overstating your signficance (p-value too small)

Ignoring stratification can deflate test statistics and inflate p-values
- Remember, DEFF < 1 for many outcomes in a stratified sample
- A stratified sample of size $n$ provides the same precision as an SRS of size $n$/DEFF
  - Effective sample size is larger than $n$
- Thus you are being too conservative (p-value too high)

The “worse crime” (usually) is ignoring clustering and “overstating” results (inflate Type 1 error)

Corrections to Chi-Square Tests for Complex Survey Data

For an SRS in an infinite population:
- $X^2$ from a chi-square test has a $\chi^2_{(r-1)(c-1)}$ distribution under $H_0$
- Thus, $E[X^2] = (r-1)(c-1)$
  - (the expected value of a chi-square random variable is equal to its DF)
For complex survey designs:
- $X^2$ from a chi-square test does not follow a $\chi^2_{(r-1)(c-1)}$ distribution under $H_0$

However, we can adjust the test statistic $X^2$ in order to make a rescaled test statistic have an approximately $\chi^2$ distribution
Two methods used most often (not very creatively named):
- First Order Correction (also called Rao-Scott first order correction)
- Second Order Correction (also called Rao-Scott second order correction)

Option 1: (Rao-Scott) First Order Correction

Idea: Match the mean of the test statistic $X^2$ (under the complex design) to the mean of a $\chi^2_{(r-1)(c-1)}$ distribution, thus “rescaling” the test statistic
“First Order” because we’re rescaling based on the mean
1. Calculate $E[X^2]$ = the expected value of $X^2$ for the complex design under $H_0$
2. Calculate first order corrected test statistic as $X^2_F = \frac{(r-1)(c-1) X^2}{E[X^2]}$
3. Compare $X^2_F$ to a $\chi^2_{(r-1)(c-1)}$ distribution
  - In practice, works better to compare $X^2_F/(r-1)(c-1)$ to an $F$ distribution

Calculation of $E[X^2]$ depends on the cell and marginal probabilities and the DEFF for estimating each of these probabilities (there’s the survey design part!)

Slight problem: we are matching the mean of the distribution – but p-values often come from the tail of the distribution (at least the p-values we care about most!)
- Variance of $X^2_F$ is actually larger than variance of $\chi^2_{(r-1)(c-1)} \rightarrow$ p-values from $X^2_F$ are slightly smaller than they should be

Option 2: (Rao-Scott) Second Order Correction

Idea: Match the mean and variance of the test statistic $X^2$ (under the complex design) to the mean and variance of a $\chi^2_{(r-1)(c-1)}$ distribution, thus “rescaling” the test statistic
Second Order because we’re rescaling based on the mean and variance
1. Calculate $E[X^2]$ = the expected value of $X^2$ for the complex design under $H_0$
2. Calculate $V[X^2]$ = the variance of $X^2$ under $H_0$
3. Calculate second order corrected test statistic as $X^2_S = \frac{\nu X^2_F}{(r-1)(c-1)}$ where $\nu = 2\frac{(E[X^2])^2}{V[X^2]}$
4. Compare $X^2_F$ to a $\chi^2_{\nu}$ distribution
  - In practice, works better to compare $X^2_S/\nu$ to an $F$ distribution
- Similar to the Satterthwaite approximation for ANOVA

Calculation of $V[X^2]$ even more involved than $E[X^2]$, requires covariance matrix of estimated cell proportions
For many software packages (Stata, R), the second order correction is the default

Activity 16.1 (Part 2)

Domain Estimation and Comparison (Part 2)