Logistic/Log-Binomial Regression
with Survey Data

PUBHBIO 7225 Lecture 21

Outline

Topics

Logistic regression with survey data
(from a probability sample)
Log-Binomial regression with survey data
(from a probability sample)

Activities

21.1 Logistic Regression and Log-Binomial Regression with NHANES

Assignments

Problem Set 5 (last one!) due Thursday 11/13/2025 11:59pm via Carmen
(this will be graded by Dr Andridge, not peer reviewed)
Quiz 5 (last one!) due Thursday 11/20/2025 11:59pm via Carmen

Regression with Categorical Outcomes

In linear regression, outcome (response variable) at least approximately continuous
On surveys, lots (many, most) outcomes of interest are categorical
- Multiple choice questions abound!
Natural approaches are generalized linear models that extend linear regression to allow the outcome to be binary or ordinal
We will consider two such regression models, in the survey context:
- Logistic regression
- Log-binomial regression
I assume you have seen logistic regression before
You may not have seen log-binomial regression; that’s okay

Logistic Regression: Review

Outcome = binary \(Y\)
Predictor = continuous or discrete or binary \(X\) (or multiple \(X\)s)

Logistic regression model: \[\log \underbrace{\left(\frac{P(Y=1|X)}{1-P(Y=1|X)}\right)}_{\text{odds}} = \mathrm{logit}\left(P(Y=1|X)\right) = \beta_0 + \beta_1 X\]

Logit link bounds the predicted probability \(0 \le P(Y=1|X) \le 1\)
- \(P(Y=1|X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} = \frac{\exp(\beta_0 + \beta_1 X)}{1+\exp(\beta_0 + \beta_1 X)}\)
Interpretation of unknown parameters \((\beta_0, \beta_1)\) comes via exponentiation (to get rid of the log)

Logistic Regression: Review (con’t)

Model: \(\mathrm{logit}\left(P(Y=1|X)\right) = \beta_0 + \beta_1 X\)
Interpreting the “slope”: \(\beta_1\)
- \(\beta_1\) = log-odds ratio for a 1 unit increase in \(X\)
- \(e^{\beta_1}\) = odds ratio for a 1 unit increase in \(X\)
- \(X\) binary:
  - The odds of \(Y=1\) for the group with \(X=1\) are \(e^{\beta_1}\) times the odds for the group with \(X=0\)
  - Similar interpretation if \(X\) is an indicator variable for one category of a multi-category categorical variable – compares odds for that category to the reference category
- \(X\) continuous/discrete:
  - The odds of \(Y=1\) multiply by \(e^{\beta_1}\) per one-unit increase in \(X\)

Can sound awkward to talk about “odds”
But …do not substitute “risk” or “more likely”!
- Odds and risk are different things!

Logistic Regression: Review (con’t)

Model: \(\mathrm{logit}\left(P(Y=1|X)\right) = \beta_0 + \beta_1 X\)
Interpreting the intercept: \(\beta_0\)
- \(\beta_0\) = intercept = log-odds that \(Y=1\) when \(X=0\)
- \(e^{\beta_0}\) = odds that \(Y=1\) when \(X=0\)
- Not super-useful interpretations!

Easier to re-write as: \(\displaystyle P(Y=1|X=0) = \frac{e^{\beta_0}}{1+e^{\beta_0}}\)
The probability of \(Y=1\) when \(X=0\) is \(\frac{e^{\beta_0}}{1+e^{\beta_0}}\)

These interpretations extend to models with more than one predictor… interpret as usual, “controlling for” or “adjusting for” or “holding constant” the other predictors

Logistic Regression with Survey Data

With survey data (from probability sample), similar scenario to linear regression
- Basically everything we discussed with linear regression also applies to logistic regression
Complex sampling design will affect point estimates and standard errors
- Especially if the design is non-ignorable (a.k.a., informative) with respect to the outcome \(y\)
Remember: logistic regression with one binary predictor is essentially equivalent to chi-square test
- Same ideas apply about needing an adjusted chi-square test (same arguments for needing survey-weighted logistic regression)

One Important Point About Logistic Regression

If logistic regression model is valid and contains an intercept, then the intercept is the only parameter estimate affected by a sampling design that depends on \(y\)

If you’ve taken epidemiology, this may sound familiar… case-control studies!
- Example: Divide population into 2 strata, people with lung cancer (cases) and people without lung cancer (controls), and select sample from each stratum
  - Lung cancer is rare → probability of selection for cases is much larger than for controls → sampling weights are smaller for cases than controls
  - If primary interest is in estimating effects of age, smoking history, etc., on presence of cancer in a logistic regression, the unequal sampling does not matter
    - If the model is correctly specified, only difference between an unweighted and weighted analysis will be in the intercept
    - Would you ever use \(\hat{\beta}_0\) to estimate the prevalence of lung cancer in such a study? No!
      (But if you had the sampling weights you could!)

Of course, if there is clustering it would impact standard errors so can’t be ignored

Details: Logistic Regression with Survey Data

How do we actually get the survey-weighted (“design-based”) estimates?
Unlike linear regression, no “closed form” equation for \(\hat{\beta}\)s
In the infinite population (model-based) world, use maximum likelihood (ML) estimation
- Find the estimates (\(\hat{\beta}\)s) that maximize the probability (“likelihood”) that the observed data \((y_i,x_i)\) came from the model
In the finite population world, the true \(\beta\)s – denoted with Roman letters, \(B\) – are what you’d get if you calculated ML estimates using the full population
Specifically, solving this system of equations for \(B\): \[\sum_{i=1}^N x_{ij} \left[y_i - \frac{\exp(\mathbf x_i^T \mathbf B)}{1+\exp(\mathbf x_i^T \mathbf B)} \right] = 0 \quad \text{for } j=1,\dots,p\] produces the true population quantities (\(\mathbf B\))

Details (con’t)

To obtain estimates we solve the system of equations using the sample and the sampling weights: \[\sum_{i \in \mathcal{S}} w_i x_{ij} \left[y_i - \frac{\exp(\mathbf{x_i}^T \mathbf{\hat{B}})}{1+\exp(\mathbf{x_i}^T \mathbf{\hat{B}})} \right] = 0 \quad \text{for } j=1,\dots,p\]
Variance estimation can be done via linearization

Logistic regression gives us odds ratios
Most surveys are cross-sectional, and we can alternatively calculate prevalence ratios

Example: Sampling Schools

Example: 2-stage cluster sample of schools (SSUs) within school districts (PSUs) from Lecture 13
- Stage 1: \(n=38\) PSUs out of \(N=757\) (districts) – roughly 5% of total
  - Total of 246 SSUs (schools) across these 38 sampled PSUs (range: 1 to 36 per PSU)
- Stage 2: \(m_i=2\) SSUs (schools) per PSU (district)
  - For districts with only 1 school, select the 1 school (i.e., \(m_i=M_i=1\))
- Result is a total of 65 SSUs
Interested in two binary measures:
- High percent of English language learners (>25% ELL)
- High API (performance) score (>700)

Example (con’t)

# Setting design ignoring second stage sampling ("ultimate PSU analysis")
des <- svydesign(id = ~dnum, weight = ~wt, fpc = ~N, data=samp)

# Estimating the population percentage of schools with high API
svymean(~highapi, design=des)

          mean     SE
highapi 0.4248 0.1171

# Estimating the population percentage of schools with high ELL
svymean(~highell, design=des)

           mean     SE
highell 0.47764 0.1124

Based on the sample, we estimate that:
- 42.5% of schools have a high API score (API score >700)
- 47.8% of schools have a high percent of English language learners (>25% ELL)
Let’s see if having high API is associated with the percentage of ELL

Example (con’t)

Crosstabulation

svyby(~highapi, ~highell, design=des, svymean)

  highell   highapi        se
0       0 0.6614786 0.1115408
1       1 0.1659574 0.1196093

66.1% of schools without a high percentage of ELL have high API
16.6% of schools with a high percentage of ELL have high API

Chi-square test

svychisq(~ highapi + highell, design=des)


    Pearson's X^2: Rao & Scott adjustment

data:  svychisq(~highapi + highell, design = des)
F = 7.9016, ndf = 1, ddf = 37, p-value = 0.007852

p-value = 0.0079

Example (con’t)

Comparing schools without high percent ELL to schools with high percent ELL:
- Odds ratio = OR = \(\displaystyle \frac{.6615/(1-.6615)}{.1660/(1-.1660)} = \frac{1.954}{.1990} = 9.82\)
- Prevalence ratio = PR = \(\displaystyle \frac{.6615}{.1660} = 3.98\)
Interpretations:
- The odds of having high API for schools with a low percent of ELL are 9.8 times the odds for schools with a high percent of ELL.
- Schools with a low percent of ELL are 4 times as likely to have high API as schools with a high percent of ELL.
Odds ratio feels “exaggerated” – hard to interpret “odds”
Prevalence ratio “easier” (more intuitive)

Example (con’t)

A survey-weighted logistic regression with high ELL as the only predictor produces same odds ratio:

svyglm(highapi ~ I(1-highell), family = binomial, design=des)

1 - level Cluster Sampling design
With (38) clusters.
svydesign(id = ~dnum, weight = ~wt, fpc = ~N, data = samp)

Call:  svyglm(formula = highapi ~ I(1 - highell), design = des, family = binomial)

Coefficients:
   (Intercept)  I(1 - highell)  
        -1.615           2.284  

Degrees of Freedom: 64 Total (i.e. Null);  36 Residual
Null Deviance:      88.63 
Residual Deviance: 71.37    AIC: 66.96

OR = \(e^\text{2.284}\) = 9.82
We can also perform a regression analysis that gives us prevalence ratios (PRs) instead of odds ratios

Log-Binomial Regression

In the logistic regression model, we have: \[\log \underbrace{\left(\frac{P(Y=1|X)}{1-P(Y=1|X)}\right)}_{\text{odds}} = \mathrm{logit}\left(P(Y=1|X)\right) = \beta_0 + \beta_1 X\]
By using a different transformation of \(P(Y=1|X)\), we have log-binomial regression: \[\log \underbrace{\left(P(Y=1|X)\right)}_{\text{prevalence or risk}} = \beta_0 + \beta_1 X\]
Note that this different transformation (different link function) does not have as nice theoretical properties, in the sense that \(E(Y|X)\) is not bounded between 0 and 1 (what are the bounds?)
However, unless \(P(Y=1|X)\) is close to 0 or 1 it is generally well-behaved
Interpretation:
- \(e^{\beta_1}\) = prevalence (of \(Y=1\)) ratio for a 1 unit increase in X

Log-Binomial Regression with Survey Data

Same ideas apply for log-binomial regression as for logistic when you have survey data
Trick for estimation is that you can use Poisson regression to obtain estimates (even though it is a log-binomial model)
- Note that in the infinite population world, you must use robust standard errors (also called “sandwich estimators”) for proper variance estimation if you use Poisson regression to fit a log-binomial model (otherwise too small)
- But in the finite population world (survey data), we are not relying on a distributional assumption, and variances derived from linearization are fine (using Poisson)

Note: You may have used log-binomial regression before to estimate risk ratios – this is the same method
- Since survey data most often cross-sectional, we prefer to not use “risk” and instead use “prevalence”

Example (con’t)

Survey-weighted log-binomial regression (fit using Poisson “trick”)

svyglm(highapi ~ I(1-highell), family = poisson, design=des)

1 - level Cluster Sampling design
With (38) clusters.
svydesign(id = ~dnum, weight = ~wt, fpc = ~N, data = samp)

Call:  svyglm(formula = highapi ~ I(1 - highell), design = des, family = poisson)

Coefficients:
   (Intercept)  I(1 - highell)  
        -1.796           1.383  

Degrees of Freedom: 64 Total (i.e. Null);  36 Residual
Null Deviance:      47.28 
Residual Deviance: 37.07    AIC: 96.3

PR = \(e^\text{1.383}\) = 3.99

Activity 21.1

Logistic Regression and Log-Binomial Regression with NHANES

Logistic/Log-Binomial Regression with Survey Data

Outline

Regression with Categorical Outcomes

Logistic Regression: Review

Logistic Regression: Review (con’t)

Logistic Regression: Review (con’t)

Logistic Regression with Survey Data

One Important Point About Logistic Regression

Details: Logistic Regression with Survey Data

Details (con’t)

Example: Sampling Schools

Example (con’t)

Example (con’t)

Example (con’t)

Example (con’t)

Log-Binomial Regression

Log-Binomial Regression with Survey Data

Example (con’t)

Activity 21.1

Logistic/Log-Binomial Regression
with Survey Data