Linear Regression
with Survey Data
PUBHBIO 7225 Lecture 20
Topics
Activities
Readings
Assignments
Until now, we’ve estimated totals, means, and proportions, for a population or subpopulation
For these types of estimates, it is (hopefully by now) obvious why we need to use weights to account for the survey design
Broad generality: weighting “matters” more for estimating marginal quantities (means/proportions/totals) than it does for estimating relationships between/among variables (like via regression)
Thought experiment:
I want to learn about student loan balances among OSU students
I sample students via a design that oversamples students from out of state (who pay higher tuition)
If I ignore the design and estimate the following quantities, under what situations might they be biased?
Mean loan balance, population of all OSU students
Mean loan balance, population of OSU students from outside Ohio (out-of-state)
Difference in mean loan balance between in-state and out-of-state students
Unweighted regression estimator is unbiased for true linear regression parameters if:
Mean model is correctly specified, (correctly specified \(X\)s), and
Sampling is ignorable – meaning that \(Z_i \perp\!\!\!\!\perp y | x\)
Weighted regression estimator protects us if one of these conditions is not met
If model is not correctly specified, weighted estimates give us an unbiased estimate of the mis-specified relationship (whereas unweighted estimates would be biased)
If sampling is non-ignorable, unweighted estimates will generally be biased, but weighted estimates will be unbiased
Should we always use weighted regression?
Classic example: Gestational age (GA) and birthweight (BW) in the 1988 National Maternal and Infant Health Survey (NMIHS)
NMIHS = Probability sample of babies whose mothers were aged 15 and older in the U.S.
Oversampled African-American and low-birthweight babies
Sample size roughly 10,000 babies
Comparing weighted and unweighted analysis using BW (g) to predict GA (weeks):
| Parameter | Weighted Estimate (SE) | Unweighted Estimate (SE) |
|---|---|---|
| Intercept | 32.08 (.17) | 26.36 (.14) |
| Birthweight (g) | .00215 (.00005) | .00383 (.00005) |
Weighted: 100 gram reduction in BW associated with a 0.215 week (1.5 day) decrease in GA
Unweighted: 100 gram reduction in BW associated with a 0.383 week (2.7 day) decrease in GA
Unweighted
Weighted
True relationship appears to be quadratic – but model misspecified as linear
Weighted model estimates give the best estimate of the linear approximation
Alternative: include a quadratic term in the model (thus weighted \(\approx\) unweighted)
One approach: include all design variables as predictors, along with any variables involved in creating the weights, and use unweighted regression
This by definition makes the sampling ignorable
This would potentially mean including in the model:
Stratum indicators
Cluster indicators (via random effects, perhaps)
Variables involved in nonresponse adjustments, poststratification/raking, etc.
A daunting task, especially for surveys with complex designs and post-survey weight adjustments
If it can be done, regression coefficient estimates and SEs will be unbiased
Sometimes (often), however, true design variables are not provided to end users, or weights are otherwise manipulated
“Pseudo-strata” instead of the true strata
“Masked variance units (MVUs)” = “Pseudo-PSUs” instead of of the true PSUs
Additional manipulations to weights also common
So this approach isn’t actually possible most of the time with publicly available data
Sometimes we might be interested in predictors of a variable that was used in the sampling design
Example: Birthweight (BW) and smoking in the 1988 NMIHS
Comparing weighted and unweighted analysis using maternal smoking status to predict BW (g) using ANCOVA:
| Mother Smoked? | Weighted Mean (SE) | Unweighted Mean (SE) |
|---|---|---|
| No | 3409 (8.0) | 2923 (12) |
| Yes | 3232 (12) | 2664 (18) |
| Difference | 177 (15) | 260 (21) |
Difference between weighted and unweighted is due to the oversampling of low birthweight babies
Can’t “adjust this away” with predictors – because the outcome (birthweight) is the design variable!
General consensus is you should use weights in the regression when:
The weight is a function of the dependent (outcome) variable
You don’t know enough about how the weights were constructed
(may be the case for some publicly available datasets)
Large sample size – even if weights are not strictly necessary, the loss of power from using weights won’t be as much of a concern
Possible ways to check to see if weighted regression is necessary:
Perform both weighted and unweighted regressions, and compare coefficients estimates
Add the weight and an interaction of the weight with each predictor to the model
From a finite population perspective, linear regression is a fitting of a line to a finite population of \(N\) units
Thinking specifically about a simple linear regression model, we have the best fitting least squares regression line: \[y = B_0 + B_1 x\]
There is no assumption of normality and no assumption about (equal) variance
\[\begin{aligned} \text{Slope: }B_1 &= \frac{\sum_{i=1}^N x_i y_i - \frac{1}{N} \left(\sum_{i=1}^N x_i \right) \left(\sum_{i=1}^N y_i\right)}{\sum_{i=1}^N x_i^2 - \frac{1}{N} \left(\sum_{i=1}^N x_i \right)^2} &= \frac{t_{xy} - \frac{t_x t_y}{N}}{t_{x^2} - \frac{(t_x)^2}{N}}\\ \text{Intercept: }B_0 &= \frac{1}{N} \sum_{i=1}^N y_i - B_1 \frac{1}{N} \sum_{i=1}^N x_i &= \frac{t_y-B_1 t_x}{N} \end{aligned}\]
\[\begin{aligned} \hat{B}_1 &= \frac{\hat{t}_{xy} - \frac{\hat{t}_x \hat{t}_y}{\hat{N}}}{\hat{t}_{x^2} - \frac{(\hat{t}_x)^2}{\hat{N}}} = \frac{\sum_{i \in \mathcal{S}} w_i x_i y_i - \frac{\left(\sum_{i \in \mathcal{S}} w_i x_i \right) \left(\sum_{i \in \mathcal{S}} w_i y_i\right)}{\sum_{i \in \mathcal{S}} w_i} }{\sum_{i \in \mathcal{S}} w_i x_i^2 - \frac{\left(\sum_{i \in \mathcal{S}} w_i x_i \right)^2}{\sum_{i \in \mathcal{S}} w_i} } \\ \hat{B}_0 &= \frac{\hat{t}_y-\hat{B}_1 \hat{t}_x}{\hat{N}} = \frac{\sum_{i \in \mathcal{S}} w_i y_i - \hat{B}_1\sum_{i \in \mathcal{S}} w_i x_i}{\sum_{i \in \mathcal{S}} w_i} \end{aligned}\]
Once again we can use linearization to estimate standard errors
\(\hat{B}_0, \hat{B}_1\) are non-linear functions of totals
Use linearization to approximate these quantities as linear functions of totals
Then, estimate the variance of this linear quantity – since we know how to estimate the variance for a total from any complex sample design
Note #1: Subtle difference in CI interpretation from infinite population inference
Note #2: Survey-Weighted Regression is not Weighted Least Squares (WLS)
In the infinite population setting:
Estimating a mean: DF = \(n-1\)
Estimating a regression coefficient: DF = \(n-1-p\) (\(p\) = # coefficients excluding the intercept)
For finite population sampling:
Estimating a mean: DF = # PSUs \(-\) # Strata
Estimating a regression coefficient:
Option 1: DF = # PSUs \(-\) # Strata \(-\) \(p\)
Option 2: DF = # PSUs \(-\) # Strata (no extra subtraction of \(p\))
Option 1 is:
Different software does different things by default…
Sometimes (for designs with small DF) these differences in DF can result in observably different p-values
Data from Lecture 13
Two-stage cluster sample (no stratification)
Sampled 38 PSUs (districts)
Sampled 1-2 SSUs (schools) per PSU (district)
Total of 65 SSUs (schools) sampled
Regression of interest is predicting API (performance index) from percent English language learners and school type (Elementary/Middle/High School)
Option 1: DF = # PSUs \(-\) # Strata = \(38 - 1 = 37\)
Option 2: DF = # PSUs \(-\) # Strata \(-\) # predictors excluding intercept = \(38 - 1 - 3 = 34\)
DF = # PSUs \(-\) # Strata = 37
Call:
svyglm(formula = api00 ~ ell + stype, design = des)
Survey design:
svydesign(id = ~dnum + snum, weight = ~wt, fpc = ~N + Mi, data = samp)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 797.7698 27.6790 28.822 < 2e-16 ***
ell -4.5490 0.4874 -9.333 2.91e-11 ***
stypeM -92.7395 29.5132 -3.142 0.00329 **
stypeH -42.5118 26.6213 -1.597 0.11879
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 4304.603)
Number of Fisher Scoring iterations: 2
DF = # PSUs \(-\) # Strata \(-\) \(p\) = 34
Call:
svyglm(formula = api00 ~ ell + stype, design = des)
Survey design:
svydesign(id = ~dnum + snum, weight = ~wt, fpc = ~N + Mi, data = samp)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 797.7698 27.6790 28.822 < 2e-16 ***
ell -4.5490 0.4874 -9.333 6.63e-11 ***
stypeM -92.7395 29.5132 -3.142 0.00346 **
stypeH -42.5118 26.6213 -1.597 0.11954
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 4304.603)
Number of Fisher Scoring iterations: 2
The \(t\)-distributions with 34 and 37 are pretty similar, doesn’t make much difference which DF you use
But in a design with fewer PSUs, it might matter more
Interested in more perspectives on whether or not to use weights in analyzing survey data?
Bollen KA, Biemer PP, Karr AF, Tueller S, and Berzofsky ME. (2016) Are survey weights needed? A review of diagnostic tests in regression analysis. Annual Review of Statistics and Its Application, 3:375-392.
Solon G, Haider SJ, and Wooldridge JM (2015). What are we weighting for? J. Human Resources, 50:301-316
Gelman A (2007). Struggles with survey weighting and regression modeling. Statistical Science, 22(2):153-164
Winship C and Radbill L (1994). Sampling weights and regression analysis. Sociological Methods and Research, 23:230-257
(PDFs on Carmen)
Linear Regression with NHANES
PUBHBIO 7225