Logistic/Log-Binomial Regression
with Survey Data

PUBHBIO 7225 Lecture 21

Outline

Topics

  • Logistic regression with survey data
    (from a probability sample)
  • Log-Binomial regression with survey data
    (from a probability sample)

Activities

  • 21.1 Logistic Regression and Log-Binomial Regression with NHANES

Assignments

  • Problem Set 5 (last one!) due Thursday 11/13/2025 11:59pm via Carmen
    (this will be graded by Dr Andridge, not peer reviewed)
  • Quiz 5 (last one!) due Thursday 11/20/2025 11:59pm via Carmen

Regression with Categorical Outcomes

  • In linear regression, outcome (response variable) at least approximately continuous

  • On surveys, lots (many, most) outcomes of interest are categorical

    • Multiple choice questions abound!
  • Natural approaches are generalized linear models that extend linear regression to allow the outcome to be binary or ordinal

  • We will consider two such regression models, in the survey context:

    • Logistic regression

    • Log-binomial regression

  • I assume you have seen logistic regression before

  • You may not have seen log-binomial regression; that’s okay

Logistic Regression: Review

  • Outcome = binary \(Y\)

  • Predictor = continuous or discrete or binary \(X\) (or multiple \(X\)s)

Logistic regression model: \[\log \underbrace{\left(\frac{P(Y=1|X)}{1-P(Y=1|X)}\right)}_{\text{odds}} = \mathrm{logit}\left(P(Y=1|X)\right) = \beta_0 + \beta_1 X\]

  • Logit link bounds the predicted probability \(0 \le P(Y=1|X) \le 1\)

    • \(P(Y=1|X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} = \frac{\exp(\beta_0 + \beta_1 X)}{1+\exp(\beta_0 + \beta_1 X)}\)
  • Interpretation of unknown parameters \((\beta_0, \beta_1)\) comes via exponentiation (to get rid of the log)

Logistic Regression: Review (con’t)

  • Model: \(\mathrm{logit}\left(P(Y=1|X)\right) = \beta_0 + \beta_1 X\)

  • Interpreting the “slope”: \(\beta_1\)

    • \(\beta_1\) = log-odds ratio for a 1 unit increase in \(X\)

    • \(e^{\beta_1}\) = odds ratio for a 1 unit increase in \(X\)

    • \(X\) binary:

      • The odds of \(Y=1\) for the group with \(X=1\) are \(e^{\beta_1}\) times the odds for the group with \(X=0\)

      • Similar interpretation if \(X\) is an indicator variable for one category of a multi-category categorical variable – compares odds for that category to the reference category

    • \(X\) continuous/discrete:

      • The odds of \(Y=1\) multiply by \(e^{\beta_1}\) per one-unit increase in \(X\)
  • Can sound awkward to talk about “odds”

  • But …do not substitute “risk” or “more likely”!

    • Odds and risk are different things!

Logistic Regression: Review (con’t)

  • Model: \(\mathrm{logit}\left(P(Y=1|X)\right) = \beta_0 + \beta_1 X\)

  • Interpreting the intercept: \(\beta_0\)

    • \(\beta_0\) = intercept = log-odds that \(Y=1\) when \(X=0\)

    • \(e^{\beta_0}\) = odds that \(Y=1\) when \(X=0\)

    • Not super-useful interpretations!

  • Easier to re-write as: \(\displaystyle P(Y=1|X=0) = \frac{e^{\beta_0}}{1+e^{\beta_0}}\)

  • The probability of \(Y=1\) when \(X=0\) is \(\frac{e^{\beta_0}}{1+e^{\beta_0}}\)

  • These interpretations extend to models with more than one predictor… interpret as usual, “controlling for” or “adjusting for” or “holding constant” the other predictors

Logistic Regression with Survey Data

  • With survey data (from probability sample), similar scenario to linear regression

    • Basically everything we discussed with linear regression also applies to logistic regression
  • Complex sampling design will affect point estimates and standard errors

    • Especially if the design is non-ignorable (a.k.a., informative) with respect to the outcome \(y\)
  • Remember: logistic regression with one binary predictor is essentially equivalent to chi-square test

    • Same ideas apply about needing an adjusted chi-square test (same arguments for needing survey-weighted logistic regression)

One Important Point About Logistic Regression

  • If logistic regression model is valid and contains an intercept, then the intercept is the only parameter estimate affected by a sampling design that depends on \(y\)
  • If you’ve taken epidemiology, this may sound familiar… case-control studies!

    • Example: Divide population into 2 strata, people with lung cancer (cases) and people without lung cancer (controls), and select sample from each stratum

      • Lung cancer is rare → probability of selection for cases is much larger than for controls → sampling weights are smaller for cases than controls

      • If primary interest is in estimating effects of age, smoking history, etc., on presence of cancer in a logistic regression, the unequal sampling does not matter

        • If the model is correctly specified, only difference between an unweighted and weighted analysis will be in the intercept

        • Would you ever use \(\hat{\beta}_0\) to estimate the prevalence of lung cancer in such a study? No!
          (But if you had the sampling weights you could!)

  • Of course, if there is clustering it would impact standard errors so can’t be ignored

Details: Logistic Regression with Survey Data

  • How do we actually get the survey-weighted (“design-based”) estimates?

  • Unlike linear regression, no “closed form” equation for \(\hat{\beta}\)s

  • In the infinite population (model-based) world, use maximum likelihood (ML) estimation

    • Find the estimates (\(\hat{\beta}\)s) that maximize the probability (“likelihood”) that the observed data \((y_i,x_i)\) came from the model
  • In the finite population world, the true \(\beta\)s – denoted with Roman letters, \(B\) – are what you’d get if you calculated ML estimates using the full population

  • Specifically, solving this system of equations for \(B\): \[\sum_{i=1}^N x_{ij} \left[y_i - \frac{\exp(\mathbf x_i^T \mathbf B)}{1+\exp(\mathbf x_i^T \mathbf B)} \right] = 0 \quad \text{for } j=1,\dots,p\] produces the true population quantities (\(\mathbf B\))

Details (con’t)

  • To obtain estimates we solve the system of equations using the sample and the sampling weights: \[\sum_{i \in \mathcal{S}} w_i x_{ij} \left[y_i - \frac{\exp(\mathbf{x_i}^T \mathbf{\hat{B}})}{1+\exp(\mathbf{x_i}^T \mathbf{\hat{B}})} \right] = 0 \quad \text{for } j=1,\dots,p\]

  • Variance estimation can be done via linearization


  • Logistic regression gives us odds ratios

  • Most surveys are cross-sectional, and we can alternatively calculate prevalence ratios

Example: Sampling Schools

  • Example: 2-stage cluster sample of schools (SSUs) within school districts (PSUs) from Lecture 13

    • Stage 1: \(n=38\) PSUs out of \(N=757\) (districts) – roughly 5% of total

      • Total of 246 SSUs (schools) across these 38 sampled PSUs (range: 1 to 36 per PSU)
    • Stage 2: \(m_i=2\) SSUs (schools) per PSU (district)

      • For districts with only 1 school, select the 1 school (i.e., \(m_i=M_i=1\))
    • Result is a total of 65 SSUs

  • Interested in two binary measures:

    • High percent of English language learners (>25% ELL)
    • High API (performance) score (>700)

Example (con’t)

# Setting design ignoring second stage sampling ("ultimate PSU analysis")
des <- svydesign(id = ~dnum, weight = ~wt, fpc = ~N, data=samp)
# Estimating the population percentage of schools with high API
svymean(~highapi, design=des)
          mean     SE
highapi 0.4248 0.1171
# Estimating the population percentage of schools with high ELL
svymean(~highell, design=des)
           mean     SE
highell 0.47764 0.1124
  • Based on the sample, we estimate that:

    • 42.5% of schools have a high API score (API score >700)

    • 47.8% of schools have a high percent of English language learners (>25% ELL)

  • Let’s see if having high API is associated with the percentage of ELL

Example (con’t)

Crosstabulation

svyby(~highapi, ~highell, design=des, svymean)
  highell   highapi        se
0       0 0.6614786 0.1115408
1       1 0.1659574 0.1196093
  • 66.1% of schools without a high percentage of ELL have high API
  • 16.6% of schools with a high percentage of ELL have high API

Chi-square test

svychisq(~ highapi + highell, design=des)

    Pearson's X^2: Rao & Scott adjustment

data:  svychisq(~highapi + highell, design = des)
F = 7.9016, ndf = 1, ddf = 37, p-value = 0.007852
  • p-value = 0.0079

Example (con’t)

  • Comparing schools without high percent ELL to schools with high percent ELL:

    • Odds ratio = OR = \(\displaystyle \frac{.6615/(1-.6615)}{.1660/(1-.1660)} = \frac{1.954}{.1990} = 9.82\)
    • Prevalence ratio = PR = \(\displaystyle \frac{.6615}{.1660} = 3.98\)
  • Interpretations:

    • The odds of having high API for schools with a low percent of ELL are 9.8 times the odds for schools with a high percent of ELL.

    • Schools with a low percent of ELL are 4 times as likely to have high API as schools with a high percent of ELL.

  • Odds ratio feels “exaggerated” – hard to interpret “odds”

  • Prevalence ratio “easier” (more intuitive)

Example (con’t)

  • A survey-weighted logistic regression with high ELL as the only predictor produces same odds ratio:
svyglm(highapi ~ I(1-highell), family = binomial, design=des)
1 - level Cluster Sampling design
With (38) clusters.
svydesign(id = ~dnum, weight = ~wt, fpc = ~N, data = samp)

Call:  svyglm(formula = highapi ~ I(1 - highell), design = des, family = binomial)

Coefficients:
   (Intercept)  I(1 - highell)  
        -1.615           2.284  

Degrees of Freedom: 64 Total (i.e. Null);  36 Residual
Null Deviance:      88.63 
Residual Deviance: 71.37    AIC: 66.96
  • OR = \(e^\text{2.284}\) = 9.82

  • We can also perform a regression analysis that gives us prevalence ratios (PRs) instead of odds ratios

Log-Binomial Regression

  • In the logistic regression model, we have: \[\log \underbrace{\left(\frac{P(Y=1|X)}{1-P(Y=1|X)}\right)}_{\text{odds}} = \mathrm{logit}\left(P(Y=1|X)\right) = \beta_0 + \beta_1 X\]

  • By using a different transformation of \(P(Y=1|X)\), we have log-binomial regression: \[\log \underbrace{\left(P(Y=1|X)\right)}_{\text{prevalence or risk}} = \beta_0 + \beta_1 X\]

  • Note that this different transformation (different link function) does not have as nice theoretical properties, in the sense that \(E(Y|X)\) is not bounded between 0 and 1 (what are the bounds?)

  • However, unless \(P(Y=1|X)\) is close to 0 or 1 it is generally well-behaved

  • Interpretation:

    • \(e^{\beta_1}\) = prevalence (of \(Y=1\)) ratio for a 1 unit increase in X

Log-Binomial Regression with Survey Data

  • Same ideas apply for log-binomial regression as for logistic when you have survey data

  • Trick for estimation is that you can use Poisson regression to obtain estimates (even though it is a log-binomial model)

    • Note that in the infinite population world, you must use robust standard errors (also called “sandwich estimators”) for proper variance estimation if you use Poisson regression to fit a log-binomial model (otherwise too small)

    • But in the finite population world (survey data), we are not relying on a distributional assumption, and variances derived from linearization are fine (using Poisson)

  • Note: You may have used log-binomial regression before to estimate risk ratios – this is the same method
    • Since survey data most often cross-sectional, we prefer to not use “risk” and instead use “prevalence”

Example (con’t)

  • Survey-weighted log-binomial regression (fit using Poisson “trick”)
svyglm(highapi ~ I(1-highell), family = poisson, design=des)
1 - level Cluster Sampling design
With (38) clusters.
svydesign(id = ~dnum, weight = ~wt, fpc = ~N, data = samp)

Call:  svyglm(formula = highapi ~ I(1 - highell), design = des, family = poisson)

Coefficients:
   (Intercept)  I(1 - highell)  
        -1.796           1.383  

Degrees of Freedom: 64 Total (i.e. Null);  36 Residual
Null Deviance:      47.28 
Residual Deviance: 37.07    AIC: 96.3
  • PR = \(e^\text{1.383}\) = 3.99

Activity 21.1

Logistic Regression and Log-Binomial Regression with NHANES