Nonresponse: Intro to Missing Data

PUBHBIO 7225 Lecture 17

Generative AI acknowledgment: Google Gemini was used to generate alt text for some images

Outline

Topics

Missing Data
- Nonresponse Bias
- Preventing Nonresponse
- Unit and Item Nonresponse
- Taxonomy of Missing Data

Activities

17.1 The Impact of Missing Data

Readings

Andridge RR and Lemeshow S (2017). Nonresponse in Sample Surveys, In Wiley StatsRef: Statistics Reference Online (PDF on Carmen)

Assignments

Peer Evaluation of Problem Set 4 due Tuesday 10/28/25 11:59pm via Carmen
Quiz 4 due Thursday 10/30/2025 11:59pm via Carmen

The Problem of Nonresponse

To this point we have assumed every sampled unit responded to the survey
When we sample “things” that aren’t people, this might be reality
- E.g., sampling to determine percentage of beds using netting in rural Gambia
- Presumably the surveyors were able to (correctly) count beds and beds with netting for all sampled compounds (within villages, within districts)
But, when sampling people there is usually (always?) some fraction of the sample that fails to respond

Two main problems caused by nonresponse:
- Reduction in power / loss of efficiency
- Potential for bias
For surveys, in general bias is the major concern

Total Survey Error Diagram

Nonresponse is only one part of the TSE framework…

A flowchart diagram of the Total Survey Error (TSE) framework. The diagram is split into two main vertical columns, "Measurement" on the left and "Representation" on the right. In the Measurement column, the flow goes from "Construct" to "Measurement" to "Response" to "Edited Data." Red arrows pointing to the right of each step label the sources of error: "Specification Error (Validity)" for the first step, "Measurement Error" for the second, and "Processing Error" for the third. The Representation column flows from "Target Population" to "Sampling Frame" to "Sample" to "Respondents" to "Postsurvey Adjustments." Similarly, red arrows label the errors at each step: "Coverage Error," "Sampling Error," "Nonresponse Error," and "Adjustment Error." Both columns converge at the bottom, with arrows from "Edited Data" and "Postsurvey Adjustments" pointing to a final box labeled "Survey Statistic."

Adapted from Groves et al. (2009), Survey Methodology, 2nd Edition

…but it is arguably the biggest source of error, or at least one of the biggest

How Nonresponse Causes Bias

Imagine that the target population can be partitioned into two groups: respondents and nonrespondents

Group	Population Size	Population Mean	Population Variance
Respondents	\(N_R\)	\(\bar{y}_{RU}\)	\(S^2_R\)
Non-Respondents	\(N_{NR}\)	\(\bar{y}_{NRU}\)	\(S^2_{NR}\)
Total	\(N\)	\(\bar{y}_U\)	\(S^2\)

The overall population mean is a weighted average of the two group means: \[\bar{y}_U = \frac{N_R}{N} \bar{y}_{RU} + \frac{N_{NR}}{N} \bar{y}_{NRU}\]

Now suppose we take an SRS of size \(n\) (from the full population of \(N\) units), and \(n_R\) units respond
- Respondents are (by definition) all from the “Respondents” group
- Nonrespondents are from the “Nonrespondents” group

How Nonresponse Causes Bias (con’t)

Group	Population Size	Population Mean	Population Variance
Respondents	\(N_R\)	\(\bar{y}_{RU}\)	\(S^2_R\)
Non-Respondents	\(N_{NR}\)	\(\bar{y}_{NRU}\)	\(S^2_{NR}\)
Total	\(N\)	\(\bar{y}_U\)	\(S^2\)

Sample size: \(n\)
Number of respondents: \(n_R\)
Sample mean based on respondents only: \(\bar{y}_R\)

\(E[\bar{y}_R] = \bar{y}_{RU}\)
- Respondent sample mean is an unbiased estimate of the respondent mean

\(E[\bar{y}_R] \ne \bar{y}_U\)
- Respondent sample mean is not an unbiased estimate of the overall mean

…unless the mean is the same for respondents and nonrespondents (i.e., unless \(\bar{y}_{RU} = \bar{y}_{NRU} = \bar{y}_U\))

How Nonresponse Causes Bias (con’t)

The bias of the respondent mean is: \[\begin{aligned} \text{Bias}(\bar{y}_R) &= E(\bar{y}_R) - \bar{y}_U\\ &= \bar{y}_{RU} - \left[ \frac{N_R}{N} \bar{y}_{RU} + \frac{N_{NR}}{N} \bar{y}_{NRU} \right] = \textcolor{red}{\frac{N_{NR}}{N}} \textcolor{blue}{(\bar{y}_{RU} - \bar{y}_{NRU})} \end{aligned}\]
Bias in the respondent mean depends on nonresponse rate and diﬀerence between respondents and nonrespondents
We cannot ever actually know this bias – we don’t observe the nonrespondents (can’t estimate \(\bar{y}_{NRU}\))

But this expression tells us that nonresponse bias is small if:
1. Nonresponse rate is small (\(\frac{N_{NR}}{N}\) close to 0)
  or
2. Mean for nonrespondents is close to the mean for the respondents \((\bar{y}_{NRU} \approx \bar{y}_{RU})\)
We can’t ever know if (2) is true, so our best bet is to try to achieve (1)

How Nonresponse Causes Bias (con’t)

Some additional comments:
- Increasing sample size does not help – doesn’t guarantee \(\bar{y}_{NRU} \approx \bar{y}_{RU}\)
  - Might just get “more of the same”
- Bias can (and will) differ for different survey items (i.e., for different \(y\))
Best strategy is to try to avoid nonresponse altogether!

Key idea: If an outcome, \(y\), is “related” to nonresponse – i.e., it differs between respondents and nonrespondents – then an estimate based on respondents only will be biased (without additional adjustment)

Essentially, if there is nonresponse, the best-case scenario is that the respondents are a random sample of the full sample.

Activity 17.1 (Part 1)

The Impact of Missing Data (Part 1)

Nonresponse a Growing Problem

Response rates steadily declining for large Federally-sponsored surveys¹
For seven major surveys, 1995 to 2015 (Y-axis is 65% to 95%):

A line graph depicting the response rates for seven major surveys from 1995 to 2015. The y-axis shows response rates from 65% to 95%. While the CPS survey started with the highest response rate around 94% and remained high, all surveys show a general downward trend over time.

Nonresponse a Growing Problem (con’t)

Recent response rates (2015-2025) for Bureau of Labor Statistics household surveys (chart source)

A line graph showing the monthly household survey response rates from April 2015 to April 2025. The CPS consistently has the highest response rates, generally staying above 70%, except for a sharp decline in 2020. The other surveys, including ATUS, CE-Diary, CE-Interview, CPI-Housing, and TPOPS, show similar trends with lower overall rates, all experiencing a sharp decline in 2020 and a gradual downward trend over the decade.

Note the COVID dip!

Nonresponse a Growing Problem (con’t)

ACS (American Community Survey) = primary source of information about U.S. demographics and housing
Recent response rates for ACS (2000-2023) (chart data source)

A line graph depicting the American Community Survey Housing Unit response rates from 2000 to 2023. The y-axis ranges from 70% to 100%. The graph shows a general decline after around 2014, with a small dip in 2004, a larger dip in 2013, and a very large dip in 2020 to a low of just over 70%. The response rate recovers to around 85% in 2021 and stays there through 2023.

RR declined 12.8 percentage points between 2010 and 2023
Again, note the COVID dip!
Note: ACS estimates are often used as the “truth” for post-survey adjustments (e.g., post-stratification) for other surveys

What Are Some Factors That Could Be Causing This?

And What Can Be Done To Mitigate/Prevent It?

A Particular Growing Problem

Looking at the ACS reasons for nonresponse shows that refusal is a growing problem

A line graph depicting the American Community Survey Housing Unit percentage of refusals and percentage of no one at home from 2000 to 2023. The y-axis ranges from 0% to 10%. The graph shows a striking increase in percent of refusals around 2017, jumping up from around 2.5% to around 9% in 2023. The percentage of no one home is steady around 1% until around 2020 when there is a modest increase to around 2% in 2023.

Designing Surveys to Minimize Nonresponse

Many “design” considerations to maximize response rates:

Survey mode selection – some modes generally provide higher response rates; sensitive topics may provide better response with self-administered modes
Time survey is administered – some calling periods or seasons of the year may yield higher response rates (e.g., avoid holidays)
Good interviewer training – persistence, gentle conversion of reluctant persons to respondents, etc.
Careful “physical” design of mail and web surveys – “official” logos, ensure mail doesn’t look like junk mail/SPAM
Carefully designed questionnaire – poorly worded questions can lead to skipped questions, easy-to-navigate paper survey if mailed
Low respondent burden – try to make the survey short and “easy” to take, possibly using split-questionnaire design (not all sampled units get all questions)
Use of incentives – given either before completion (“pre-incentive”) or upon completion (“post-incentive” or “reward”)
Extensive follow-up – Might not get response on the first try, might consider alternate modes for follow-up (e.g., telephone follow-up to mailed survey)
Responsive Design – Adjust data collection process while it is ongoing, e.g., by reallocating effort to subgroups with higher non-contact or non-response

Two Types of Nonresponse in Surveys

Unit nonresponse = a sampled unit completely fails to respond (to entire survey)

Item nonresponse = a sampled unit responds but fails to answer to a specific item

In general, we:
- Deal with unit nonresponse by making adjustments to the base survey weights
- Deal with item nonresponse with imputation (or, by ignoring it…)
For large-scale surveys, unit nonresponse is arguably the more severe problem
(declining response rates…)
Before we talk about how to handle the missing data, we need to talk about different types of missing data

Taxonomy of Missing Data

We classify missing data based on the underlying reason the data are missing
In general, not testable – just an assumption
Three types of missing data:
- MCAR: Missing Completely At Random 😀
- MAR: Missing At Random 😐
- MNAR: Missing Not At Random 😱
Defined by what factors are associated with the probability of missingness (or conversely, with the probability of response)

Notation

\(y_i\) = survey outcome of interest for unit \(i\), only observed for respondents
\(\mathbf{x}_i\) = vector of information known about unit \(i\), observed for ALL sampled units
- E.g., design information (stratum, cluster membership), or data available on the sampling frame
\(R_i\) = response indictor for unit \(i\) (=1 for respondents, =0 for nonrespondents)
\(\phi_i= P(R_i=1)\) = probability unit \(i\) responds = response propensity (unknown)

And we also have:

\(Z_i\) = selection indicator (=1 for sampled units, =0 for non-sampled units)
\(\pi_i = P(Z_i=1)\) = selection probability (known!)

MCAR: Missing Completely At Random

MCAR = Best-case scenario: \(P(R_i=1 | y_i, \mathbf{x}_i) = P(R_i=1)\)
Probability a variable has missing values does not depend on any factors/variables (including itself)

Examples:
- Coin flip to determine if respond
- Computer glitch that wipes out 10% of records
No systematic difference between respondents and nonrespondents
- Respondents are indistinguishable from nonrespondents]
- \(\bar{y}_{RU} = \bar{y}_{NRU}\) → \(\text{Bias}(\bar{y}_R) = \frac{N_{NR}}{N} (\bar{y}_{RU} - \bar{y}_{NRU}) = \frac{N_{NR}}{N} \times 0 = 0\)
A complete case analysis implicitly makes this assumption!

Estimates of survey totals will be biased (literally not counting up enough people!)
Estimates of quantities like means or proportions will be unbiased
Standard errors will be larger than if no nonresponse

MAR: Missing At Random

MAR = Next-best-case scenario: \(P(R_i=1 | y_i, \mathbf{x}_i) = P(R_i=1|\mathbf{x}_i)\)
Probability a variable has missing values depends on observed data

Examples:
- Older adults more likely to respond than younger adults – and we observe all sampled units’ ages (i.e., it’s on the sampling frame)
- Unmarried moms less likely to respond than married moms – and we have marital status on the sampling frame
- People in one stratum are less likely to respond than in another stratum
Nonrespondents are diﬀerent from respondents, but we can see how they are diﬀerent
- Since we observe how the respondents and nonrespondents diﬀer, we can adjust for it

Estimates for a \(y\) that is related to the variable(s) that diﬀer between respondents and nonrespondents will be biased
But, we can adjust survey weights to “make-up” the diﬀerence and remove bias (coming next lecture!)

Example: MAR

SRS of women who recently gave birth in Ohio
- Outcome of interest: Postpartum feelings of social isolation
- Nonresponse Problem: All married moms respond, but only 50% of unmarried moms respond
If married women are less likely to experience social isolation than unmarried women, our estimate of the prevalence of social isolation will be __________ (too low? too high?)
Suppose you sampled 100 married moms and 100 non-married moms.

In truth, 25% of the married moms feel isolated, and 50% of the non-married moms feel isolated:

TRUTH	Isolated	Not Isolated	Total
Married	25	75	100
Not Married	50	50	100

But only half the non-married moms respond (assume nonresponse unrelated to isolation status):

WITH NONRESPONSE	Isolated	Not Isolated	Total
Married
Not Married

What is the prevalence of isolation if the full sample had responded?
With nonresponse?

Example: MAR (con’t)

WITH NONRESPONSE	Isolated	Not Isolated	Total
Married	25	75	100
Not Married	25	25	50

Observed prevalence of isolation = 0.333 = 33.3%
True prevalence = 0.375 = 37.5%

To “adjust” for this MAR mechanism, we can upweight the non-married moms
Essentially, we want to make the responding sample look like the full sample with respect to marital status
- Only half the unmarried moms responded (50 out of 100)
- Make each responding unmarried mom “represent” 2 unmarried moms
- Thus we double the weights of the unmarried moms to “adjust” for the nonresponse
- Estimated prevalence is:

Note: this assumes that marital status is on the sampling frame – so that we know it for all sampled women – so we know whose weights to double!

MNAR: Missing Not At Random (sometimes NMAR)

MNAR = Worst-case scenario: \(P(R_i=1 | y_i, \mathbf{x}_i) = P(R_i=1|y_i, \mathbf{x}_i)\)
Probability a variable has missing values depends on values of unobserved data

Examples:
- Depressed moms less likely to respond than non-depressed moms, but we don’t know depression status unless they respond
- Higher income earners less likely to report their income than low wage earners, but this information isn’t on the sampling frame
Nonrespondents are diﬀerent from respondents, and we do not observe the characteristic(s) that make them diﬀerent

Can never know if we have MNAR (not possible to test for it)
Inference under MNAR generally requires us to posit a model for the response mechanism where response depends on unobserved variables (like a survey outcome, \(y\)), which we can never confirm is correct
Usually we assume MAR, and (sometimes) assess how wrong our conclusions would be if the truth were MNAR – called a sensitivity analysis

Activity 17.1 (Part 2)

The Impact of Missing Data (Part 2)