Course Intro: Vocabulary of Sampling
PUBHBIO 7225 Lecture 1
Generative AI acknowledgment: MS Copilot was used to generate alt text for images
Outline
Topics
- Overview of the class (syllabus)
- Why sample?
- Vocabulary of sampling
- Probability vs Non-probability samples
- Types of probability samples
Activities
- 1.1 Vocabulary of Surveys
Assignments
- “About Me” questionnaire due Thursday 8/28/2025 11:59pm via Carmen
- Problem Set 1 due Thursday 9/4/245 11:59pm via Carmen
Why do we sample?
Census – measurement taken on all units in the population
Sample – measurements taken on a subset of units in the population
Reasons we might want to sample:
Vocabulary of Survey Sampling
Observation Unit (or Element) - One (non-overlapping) unit on which a measurement is taken
Target Population - Complete collection of observation units that we want to study
Sample - A subset of a population (that will be/has been measured)
Sampled Population - All observation units that could possibly be in the sample (the population from which the sample was actually taken)
Sampling Unit - A unit that can be selected for a sample (Might be an observation unit, or might be a collection of observation units, e.g., a household)
Sampling Frame - A list of all sampling units (for some sampling designs this might not actually be available)
Example
Imagine a telephone survey of likely voters
- Observation unit = one person
- Target population = all people eligible to vote who are likely to vote
- Sampling unit = one household/one phone number (could be one person if cell phone frame)
- Sampling frame = list of residential phone numbers (or, list of cell phone numbers)
Mismatch between target population and sampled population certainly possible
![Venn diagram showing the overlap between the target population, sampling frame, and sampled population in a survey, highlighting exclusions due to ineligibility or nonresponse.]()
Figure 1.1 from Lohr (2010), Sampling Design and Analysis, 2nd edition
Activity 1.1 (Part 1)
Vocabulary of Survey Sampling (Part 1)
Probability vs Non-Probability Samples
Probability Sample – each unit in the population has a known, positive probability of selection, and randomness is involved in the selection of which units are actually included in the sample
Non-Probability Sample – the probability a unit is included in the sample cannot be calculated
Types of Non-Probability Samples
- Convenience sampling – units are selected into the sample simply because they are easy to select
- Voluntary/Self-selection sampling – units self-select into the sample
- Quota sampling – pre-specify the number of units you want to end up with in each subgroup, and select them via convenience or other non-probability method
- Most famous example: pre-election polls for 1948 election (Dewey vs Truman)
- Purposive/Judgement sampling – researcher selects units into the sample based on their own judgement
- Snowball/Respondent-Driven sampling – start by including one or more units in the (usually “hard-to-reach”) population, then use these units to identify further units
Example
Sampling OSU students with various non-probability schemes
- Convenience sampling – Recruit OSU students to take your survey by standing outside the library and asking students if they will participate
- Voluntary/Self-selection sampling – Put up a flyer with a URL for your survey, and interested students can go to the website and take the survey
- Quota sampling – Pre-specify that you’ll get 100 students from Ohio and 100 students from outside Ohio to take your survey, and stand outside the library to find these students
- Purposive/Judgement sampling – You pick students with specific characteristics that you are most interested in having in your sample
- Snowball/Respondent-Driven sampling – You identify a few students who are in your target population (e.g., use illicit drugs), and then have them pass along your survey information to other students they know who are in the population (e.g., other drug users)
(Big) Problem: Selection Bias
Selection bias = bias that arises when part of the target population is not in the sampled population, or, more generally, when some population units are sampled at a different rate than intended by the investigator
Predominant concern with non-probability samples
- Units (people) in the sample might be really different from non-sampled units – but we don’t know how
- Even if, say, the distribution of state of residence (Ohio, Not Ohio) is the same in your non-probability sample as in the target population, there is no guarantee that the students are “representative” of the target population with respect to survey outcomes/measures
- e.g., Ohioans who self-select into your sample might be different than Ohioans who do not
- Selection bias an also be a problem for probability samples when there is nonresponse
This course will focus exclusively on probability samples.
If you are interested in learning more about how non-probability samples can be combined with probability samples to improve inference, let me know.
Simple Random Sample (SRS)
SRS – every possible subset of \(n\) units in the population has the same chance of being in the sample
- Note: a necessary but not sufficient condition for an SRS is that every unit has equal probability of selection
Subtypes:
Simple random sample with replacement (SRSWR) – a unit could be selected into the sample more than once
- Pick a unit each with equal probability, put it back, pick another unit
- Sometimes called unrestricted random sampling (URS)
Simple random sample without replacement (SRSWOR or just SRS) – a unit can only be selected into the sample once
- In general when survey samplers say “SRS” we mean SRSWOR
Stratified Random Sample
Stratified Random Sample – divide population into \(H\) distinct subgroups called strata, and take SRS from each stratum, with each SRS taken independently
- Strata must be mutually exclusive (each unit only belongs to 1 stratum)
- Strata must be exhaustive (all units belong to a stratum)
- Strata are often subgroups of interest
Pros/Cons:
- Ensures (at least some) elements are selected from each stratum (subgroups)
- Can reduce variance in overall estimates (increase precision) relative to an SRS
- If differences among strata explain a significant proportion of the variability in the attribute of interest, gain in precision can be large
Allocation Method - how much of the sample comes from each stratum (we will talk about this in depth later in the course)
Cluster Sample
Cluster Sample – individual units in the population are aggregated into larger sampling units called clusters, and a sample of clusters is selected
- Often used when you do not have a full list of individual units in the population, but you do have a list of all clusters in the population (e.g., list of all residences but not names of all residents)
Subtypes:
One-stage cluster sample – select all units from the selected clusters into the sample
Two-stage cluster sample – only select some units from the selected clusters into the sample
Pros/Cons:
- Can be less expensive to field compared to other designs (e.g., SRS)
- Can cause a decrease in precision relative to SRS, because units within a cluster tend to be similar
Systematic Sample
Systematic Sample – a starting point is chosen from a list of all population units, and that unit and each \(k\)th unit after that one on the list is chosen.
Pros/Cons:
Easy to implement
If the list is randomly ordered, systematic sample will behave like an SRS and you can use analysis methods appropriate for an SRS
- A systematic sample is not an SRS, because all possible subsets of \(n\) units do not have the same probability of being sampled
- For example, if the 100th unit is chosen, the 101st unit cannot be chosen – so the probability of a sample that contains both of these units is 0 (but the probability of other samples is >0)
Systematic sampling is technically a form of cluster sampling
Example paper: Pinterest Homemade Sunscreens: A Recipe for Sunburn (Merten et al., 2010)
Probability Proportional to Size (PPS)
Probability Proportional to Size - the probability a unit is selected into the sample is directly proportional to a size measure that is known for all units before sampling
Pros/Cons:
- Can reduce variance when units vary substantially in size
- Must know the “size” of all units before sampling
- E.g., sampling counties in the U.S. using number of people who live there as the “size”
- Often used as part of a complex design, not as the only design feature
Visualization
- Population of \(N = 100\) individual units, want sample of \(n = 20\)
- Strata: 4, each with 25 units
- Clusters: 20, each with 5 units
Combining Design Features
Complex Samples - surveys with more than one of these elements, e.g., stratified cluster sample, two-stage cluster sample
One specific feature you may encounter:
EPSEM sampling - Equal Probability of Selection Method
Not a specific sampling method, instead this refers to using a sampling technique that results in each population unit having the same probability of selection
Also called a self-weighting sample
By definition SRS and Systematic sampling are EPSEM; other designs can be as well (stratified, cluster, multistage)
Historically EPSEM samples were preferred due to computing limitations; now not so important (and often we want to oversample certain subgroups)
Activity 1.1 (Part 2)
Vocabulary of Survey Sampling (Part 2)