Logistic/Log-Binomial Regression
with Survey Data
PUBHBIO 7225 Lecture 21
Topics
Activities
Assignments
In linear regression, outcome (response variable) at least approximately continuous
On surveys, lots (many, most) outcomes of interest are categorical
Natural approaches are generalized linear models that extend linear regression to allow the outcome to be binary or ordinal
We will consider two such regression models, in the survey context:
Logistic regression
Log-binomial regression
I assume you have seen logistic regression before
You may not have seen log-binomial regression; that’s okay
Outcome = binary \(Y\)
Predictor = continuous or discrete or binary \(X\) (or multiple \(X\)s)
Logistic regression model: \[\log \underbrace{\left(\frac{P(Y=1|X)}{1-P(Y=1|X)}\right)}_{\text{odds}} = \mathrm{logit}\left(P(Y=1|X)\right) = \beta_0 + \beta_1 X\]
Logit link bounds the predicted probability \(0 \le P(Y=1|X) \le 1\)
Interpretation of unknown parameters \((\beta_0, \beta_1)\) comes via exponentiation (to get rid of the log)
Model: \(\mathrm{logit}\left(P(Y=1|X)\right) = \beta_0 + \beta_1 X\)
Interpreting the “slope”: \(\beta_1\)
\(\beta_1\) = log-odds ratio for a 1 unit increase in \(X\)
\(e^{\beta_1}\) = odds ratio for a 1 unit increase in \(X\)
\(X\) binary:
The odds of \(Y=1\) for the group with \(X=1\) are \(e^{\beta_1}\) times the odds for the group with \(X=0\)
Similar interpretation if \(X\) is an indicator variable for one category of a multi-category categorical variable – compares odds for that category to the reference category
\(X\) continuous/discrete:
Can sound awkward to talk about “odds”
But …do not substitute “risk” or “more likely”!
Model: \(\mathrm{logit}\left(P(Y=1|X)\right) = \beta_0 + \beta_1 X\)
Interpreting the intercept: \(\beta_0\)
\(\beta_0\) = intercept = log-odds that \(Y=1\) when \(X=0\)
\(e^{\beta_0}\) = odds that \(Y=1\) when \(X=0\)
Not super-useful interpretations!
Easier to re-write as: \(\displaystyle P(Y=1|X=0) = \frac{e^{\beta_0}}{1+e^{\beta_0}}\)
The probability of \(Y=1\) when \(X=0\) is \(\frac{e^{\beta_0}}{1+e^{\beta_0}}\)
With survey data (from probability sample), similar scenario to linear regression
Complex sampling design will affect point estimates and standard errors
Remember: logistic regression with one binary predictor is essentially equivalent to chi-square test
If you’ve taken epidemiology, this may sound familiar… case-control studies!
Example: Divide population into 2 strata, people with lung cancer (cases) and people without lung cancer (controls), and select sample from each stratum
Lung cancer is rare → probability of selection for cases is much larger than for controls → sampling weights are smaller for cases than controls
If primary interest is in estimating effects of age, smoking history, etc., on presence of cancer in a logistic regression, the unequal sampling does not matter
If the model is correctly specified, only difference between an unweighted and weighted analysis will be in the intercept
Would you ever use \(\hat{\beta}_0\) to estimate the prevalence of lung cancer in such a study? No!
(But if you had the sampling weights you could!)
How do we actually get the survey-weighted (“design-based”) estimates?
Unlike linear regression, no “closed form” equation for \(\hat{\beta}\)s
In the infinite population (model-based) world, use maximum likelihood (ML) estimation
In the finite population world, the true \(\beta\)s – denoted with Roman letters, \(B\) – are what you’d get if you calculated ML estimates using the full population
Specifically, solving this system of equations for \(B\): \[\sum_{i=1}^N x_{ij} \left[y_i - \frac{\exp(\mathbf x_i^T \mathbf B)}{1+\exp(\mathbf x_i^T \mathbf B)} \right] = 0 \quad \text{for } j=1,\dots,p\] produces the true population quantities (\(\mathbf B\))
To obtain estimates we solve the system of equations using the sample and the sampling weights: \[\sum_{i \in \mathcal{S}} w_i x_{ij} \left[y_i - \frac{\exp(\mathbf{x_i}^T \mathbf{\hat{B}})}{1+\exp(\mathbf{x_i}^T \mathbf{\hat{B}})} \right] = 0 \quad \text{for } j=1,\dots,p\]
Variance estimation can be done via linearization
Logistic regression gives us odds ratios
Most surveys are cross-sectional, and we can alternatively calculate prevalence ratios
Example: 2-stage cluster sample of schools (SSUs) within school districts (PSUs) from Lecture 13
Stage 1: \(n=38\) PSUs out of \(N=757\) (districts) – roughly 5% of total
Stage 2: \(m_i=2\) SSUs (schools) per PSU (district)
Result is a total of 65 SSUs
Interested in two binary measures:
mean SE
highapi 0.4248 0.1171
mean SE
highell 0.47764 0.1124
Based on the sample, we estimate that:
42.5% of schools have a high API score (API score >700)
47.8% of schools have a high percent of English language learners (>25% ELL)
Let’s see if having high API is associated with the percentage of ELL
Crosstabulation
highell highapi se
0 0 0.6614786 0.1115408
1 1 0.1659574 0.1196093
Chi-square test
Pearson's X^2: Rao & Scott adjustment
data: svychisq(~highapi + highell, design = des)
F = 7.9016, ndf = 1, ddf = 37, p-value = 0.007852
Comparing schools without high percent ELL to schools with high percent ELL:
Interpretations:
The odds of having high API for schools with a low percent of ELL are 9.8 times the odds for schools with a high percent of ELL.
Schools with a low percent of ELL are 4 times as likely to have high API as schools with a high percent of ELL.
Odds ratio feels “exaggerated” – hard to interpret “odds”
Prevalence ratio “easier” (more intuitive)
1 - level Cluster Sampling design
With (38) clusters.
svydesign(id = ~dnum, weight = ~wt, fpc = ~N, data = samp)
Call: svyglm(formula = highapi ~ I(1 - highell), design = des, family = binomial)
Coefficients:
(Intercept) I(1 - highell)
-1.615 2.284
Degrees of Freedom: 64 Total (i.e. Null); 36 Residual
Null Deviance: 88.63
Residual Deviance: 71.37 AIC: 66.96
OR = \(e^\text{2.284}\) = 9.82
We can also perform a regression analysis that gives us prevalence ratios (PRs) instead of odds ratios
In the logistic regression model, we have: \[\log \underbrace{\left(\frac{P(Y=1|X)}{1-P(Y=1|X)}\right)}_{\text{odds}} = \mathrm{logit}\left(P(Y=1|X)\right) = \beta_0 + \beta_1 X\]
By using a different transformation of \(P(Y=1|X)\), we have log-binomial regression: \[\log \underbrace{\left(P(Y=1|X)\right)}_{\text{prevalence or risk}} = \beta_0 + \beta_1 X\]
Note that this different transformation (different link function) does not have as nice theoretical properties, in the sense that \(E(Y|X)\) is not bounded between 0 and 1 (what are the bounds?)
However, unless \(P(Y=1|X)\) is close to 0 or 1 it is generally well-behaved
Interpretation:
Same ideas apply for log-binomial regression as for logistic when you have survey data
Trick for estimation is that you can use Poisson regression to obtain estimates (even though it is a log-binomial model)
Note that in the infinite population world, you must use robust standard errors (also called “sandwich estimators”) for proper variance estimation if you use Poisson regression to fit a log-binomial model (otherwise too small)
But in the finite population world (survey data), we are not relying on a distributional assumption, and variances derived from linearization are fine (using Poisson)
1 - level Cluster Sampling design
With (38) clusters.
svydesign(id = ~dnum, weight = ~wt, fpc = ~N, data = samp)
Call: svyglm(formula = highapi ~ I(1 - highell), design = des, family = poisson)
Coefficients:
(Intercept) I(1 - highell)
-1.796 1.383
Degrees of Freedom: 64 Total (i.e. Null); 36 Residual
Null Deviance: 47.28
Residual Deviance: 37.07 AIC: 96.3
Logistic Regression and Log-Binomial Regression with NHANES
PUBHBIO 7225