Probability Proportional
to Size Sampling

PUBHBIO 7225 Lecture 14

Outline

Topics

Sampling with unequal probabilities:
Probability Proportional to Size (PPS)
- PPS with replacement
- PPS without replacement

Activities

14.1 PPS Sampling

Readings

Probability Proportional to Size (PPS) Sampling, Encyclopedia of Survey Research Methods, 2008 (editor: P Lavrakas) (PDF on Carmen)

Assignments

Problem Set 4 due Thursday 10/23/2025 11:59pm via Carmen
(this is after fall break)

So Far: SRS at Each Stage

One-stage designs
- Sample \(n\) of \(N\) PSUs (sample = \(\mathcal{S}\))
- \(\pi_i = \frac{n}{N}\) for PSU \(i\) → \(w_i = \frac{N}{n}\)
- Unbiased estimate of total \(t\): \(\hat{t} = \sum_{i \in \mathcal{S}} w_i t_i\)
Two-stage designs
- Stage 1: Sample \(n\) of \(N\) PSUs (sample = \(\mathcal{S}\))
- Stage 2: Sample \(m_i\) of \(M_i\) SSUs for sampled PSU \(i\) (sample = \(\mathcal{S}_i\))
- \(\pi_{ij} = \frac{n}{N}\frac{m_i}{M_i}\) for SSU \(j\) in PSU \(i\) → \(w_{ij} = \frac{N}{n}\frac{M_i}{m_i}\)
- Unbiased estimate of total \(t\): \(\hat{t}= \sum_{i \in \mathcal{S}} \sum_{j \in \mathcal{S}_i} w_{ij} y_{ij}\)

These designs do not take into account that totals \((t_i)\) are often correlated with the size of PSUs \((M_i)\)

We saw that ratio estimation can improve estimation of \(\bar{y}_U\) – but can we improve on estimation of \(t\)?

Time Machine: Back to Lecture 3 (Sampling Clinics)

There are \(N=5\) clinics in a town and you want to estimate the average number of patients seen in a day (in the town)
The population of clinics is:

Clinic ID	Number of Doctors	Number of Patients Seen in a Day (\(y\))
1	2	10
2	3	20
3	4	50
4	5	60
5	10	150

True mean number of patients: \(\bar{y}_U = 58\)
The number of patients seen is correlated with the number of doctors (makes sense!).
We used this auxiliary data to create a more efficient sampling scheme than an SRS
That is the idea behind PPS sampling!

Sampling Clinics (con’t)

Design 1: SRS of \(n=2\) clinics (for each sampled clinic: \(\pi_i = n/N = 2/5\))

Clinics Sampled	Selection Probability	Weighted Mean
1,2	1/10	15
1,3	1/10	30
1,4	1/10	35
1,5	1/10	80
2,3	1/10	35
2,4	1/10	40
2,5	1/10	85
3,4	1/10	55
3,5	1/10	100
4,5	1/10	105

\(E(\bar{y})=58\) (unbiased)
\(V(\bar{y})=921\)

Design 2: Select largest clinic with certainty \((\pi_i = 1)\), one of the others with equal probability \((\pi_i = 1/4)\)

Clinics Sampled	Selection Probability	Weighted Mean
1,5	1/4	38
2,5	1/4	46
3,4	1/4	70
4,5	1/4	78

\(E(\hat{\theta})=58\) (unbiased)
\(V(\hat{\theta})=272\) \(\rightarrow\) a whole lot smaller than SRS!

Example: Sampling Schools

Consider Activity 13.1, where you sampled students within schools using a 2-stage cluster sampling scheme
Population: \(N=30\) schools with a total of \(M_0 = \sum_{i = 1}^N M_i = 14{,}989\) students
Assume it is inexpensive to measure all students in a school, so use 1-stage cluster sampling

\(N=30\) schools
\(M_i\) = number of students per school = size variable
\(y_{ij}=\) indicator for if \(j\)th student in \(i\)th school is female
\(t_i= \sum_{j=1}^{M_i} y_{ij}\) = total number of females in school \(i\)

School	# Students \((M_i)\)	# Females \((t_i)\)
1	643	324
2	492	249
3	574	284
4	324	163
⋮	⋮	⋮

Let’s consider different ways we can sample \(n=2\) schools from this population to estimate \(t\), the total number of females in the population

Tangent: “Pseudo-PSUs”

Many large-scale surveys have complex designs, with stratification and multi-stage clustering
- Use lots of strata to increase precision
- Often, so highly stratified that each stratum contains a small number of PSUs

But, we need to have at least 2 PSUs per stratum to estimate variances

Note that despite this, sometimes only 1 PSU is sampled in highly stratified designs
- In these cases, the sampled PSUs are divided up into pseudo-PSUs for the purposes of variance estimation
- You use these pseudo-PSUs just like “real” PSUs
- You often see this when using publicly available survey data
  - “masked variance PSUs”
  - “masked variance units”

Note: Sampling only 2 PSUs is actually pretty common!

Sampling Design A: SRS

Design A: SRS (without replacement) of \(n=2\) from the \(N=30\) schools

I took an SRS and obtained this sample:

School	Number of Students \((M_i)\)	Number of Females \((t_i)\)
9	473	237
22	415	204

To estimate the total number of females:
- \(\pi_i = n/N = 2/30 = 1/15\)
- \(w_i = 1/\pi_i = N/n = 15\)
- \(\hat{t}= \sum_{i \in \mathcal{S}} w_i t_i = (15 \times 237) + (15 \times 204) = 6615\)

Note: Could also calculate as: \(\hat{t}= N \bar{t}= 30 \times (237 + 204)/2 = 30 \times 220.5 = 6615\)

SRS (con’t)

By properties of SRS estimators (think way back!):
- \(\displaystyle \widehat{V}(\hat{t})=N^2 \left(1-\frac{n}{N}\right)\frac{s^2}{n}\)
In my sample:
- \(s^2 = \frac{1}{n-1} \sum_{i \in \mathcal{S}} (t_i - \bar{t})^2 = \frac{1}{2-1} \left[(237 - 220.5)^2 + (204 - 220.5)^2 \right] = 544.5\)
- \(\displaystyle \widehat{V}(\hat{t})=N^2 \left(1-\frac{n}{N}\right)\frac{s^2}{n} = 30^2 \left(1 - \frac{2}{30} \right) \frac{544.5}{22} = 228690\)
- \(\displaystyle \widehat{SE}(\hat{t}) = \sqrt{228690} = 478.2\)

des.srs <- svydesign(id = ~1, weight = ~wt, fpc = ~N, data=samp_srs)
svytotal(~totfemale, design = des.srs)

          total     SE
totfemale  6615 478.22

Probability Proportional to Size (PPS) Sampling

Think about the efficiency gain in the clinics example – always sampling the much larger clinic resulted in a much smaller variance – generalizing this idea produces PPS sampling

Probability Proportional to Size (PPS) sampling = sample selection methods in which the probability of selection for a sampling unit is directly proportional to a size measure, \(X_i\), which is known for all units before sampling (with “larger” units more likely to be selected)

We will use throughout this lecture \(X_i = M_i =\) number of SSUs in the PSU
- I.e., \(X_i\) = the “size” of the PSU = \(M_i\)
But, other \(X_i\) are possible
- Example: Sampling businesses using \(X_i\) = last year’s revenue as reported to IRS
Note that we want to choose an \(X_i\) that is:
- Proportional to the \(y_i\) that we are trying to estimate
- Strictly positive \((>0)\)

Sampling Design B: PPS With Replacement

PPSWR = Simplest PPS scheme, where inclusion probabilities are proportional to \(X_i\) and a PSU can be sampled more than once (with replacement)

Conduct \(n\) independent draws of a single PSU from the population
The probability of selecting PSU \(i\) on each draw is proportional to \(X_i\)
A PSU can be sampled more than once (with replacement)
Probability PSU \(i\) selected on any one draw = (“psi”) = \(\displaystyle \psi_i = \frac{X_i}{\sum_{i=1}^N X_i}\)

Note the difference:

\(\psi_i\) = probability PSU \(i\) is selected on a single draw
\(\pi_i\) = probability PSU \(i\) is in the sample (inclusion probability) = \(2 \psi_i - \psi_i^2\) (via addition rule for probabilities)
\(\psi_i \ne \pi_i\)

PPSWR (con’t)

For the population of schools:

School	\(M_i\)	\(\psi_i = M_i/(\sum_{i=1}^N M_i)\)
1	643	643/14989 = 0.042898125
2	492	492/14989 = 0.032824071
3	574	574/14989 = 0.038294749
4	324	324/14989 = 0.021615852
5	493	493/14989 = 0.032890787
⋮	⋮	⋮
Sum:	14989	1

The probability of selection on a single draw, \(\psi_i\), is larger for the larger schools

PPSWR Algorithms

To obtain a PPS With Replacement, many possible sampling algorithms:
- Cumulative Size Method, Cumulative \(\psi_i\) Method, Lahiri’s Rejective Method, others…
Easiest, conceptually, is the Cumulative Size Method:
1. Create a list of the PSUs with their sizes, \(M_i\)
2. Calculate the cumulative sums of \(M_i\) for each PSU
  - PSU 1: cumulative sum = \(M_1\)
  - PSU 2: cumulative sum = \(M_1+M_2\)
  - PSU 3: cumulative sum = \(M_1+M_2+M_3\)
  - etc.
3. Randomly select \(n\) integers between 1 and \(M_0 = \sum_{i=1}^N M_i\)
4. Sample the PSUs whose cumulative sum contains these \(n\) integers
  - Select PSU 1 if integer between \(1\) and \(M_1\)
  - Select PSU 2 if integer between \(M_1+1\) and \(M_1+M_2\)
  - Select PSU 3 if integer between \(M_1+M_2+1\) and \(M_1+M_2+M_3\)
  - etc.

Example of Cumulative Size Method

For the population of schools:

School	\(M_i\)	Cumulative Size	Cumulative Size (Values)	Range
1	643	\(M_1\)	643	1 to 643
2	492	\(M_1+M_2\)	643 + 492 = 1153	644 to 1153
3	574	\(M_1+M_2+M_3\)	643 + 492 + 574 = 1709	1154 to 1709
4	324	\(M_1+M_2+M_3+M_4\)	643 + 492 + 574 + 324 = 2033	1710 to 2033
⋮	⋮	⋮	⋮	⋮
Sum:	14989

Randomly select \(n=2\) integers from 1 to 14989
- My integers: 3572, 11002
- These correspond to:
  - PSU 8: range 3376 to 3875
  - PSU 23: range 10935 to 11624
- So my sample is PSUs 8 & 23
Note: a PSU could be selected twice!

PPSWR: Estimating the Population Total

Steps

Estimate the total based on each draw – i.e., from each sample of size 1
- Estimated total based on one PSU = \(\frac{\text{total for that PSU}}{\text{probability of selection on 1 draw}} = \frac{t_i}{\psi_i}\)
Average these \(n\) estimated totals to produce unbiased estimator of the total = \(\hat{t}_{wr} = \frac{1}{n}\sum_{i \in \mathcal{S}} \frac{t_i}{\psi_i}\)

Example

My PPSWR sample of \(n=2\) schools:

School	\(M_i\)	\(\psi_i = M_i/(\sum_{i=1}^N M_i)\)	\(t_i\)	\(\frac{t_i}{\psi_i}\)
8	500	\(\frac{500}{14989} = 0.0333578\)	251	\(\frac{251}{0.0333578} = 7524.477\)
23	690	\(\frac{690}{14989} = 0.04603376\)	343	\(\frac{343}{0.04603376} = 7451.0533\)

Estimated total = \(\hat{t}_{wr} = \frac{1}{n}\sum_{i \in \mathcal{S}} \frac{t_i}{\psi_i} = (7524.5 + 7451.1)/2 = 7487.8\)

Note: If a PSU is selected into the sample more than once, you’ll include its estimate multiple times

PPSWR: Estimating the Variance of the Population Total

Estimated variance of \(\hat{t}_{wr}\): \[\widehat{V}(\hat{t}_{wr}) = \frac{1}{n} \frac{1}{n-1} \sum_{i \in \mathcal{S}} \left( \frac{t_i}{\psi_i} - \hat{t}_{wr} \right)^2 = \frac{s^2}{n}\] where \(s^2\) = variance of the \(n\) individual estimates of the total (the \(\frac{t_i}{\psi_i}\))

My PPSWR sample of \(n=2\) schools:

School	\(\psi_i\)	\(t_i\)	\(\frac{t_i}{\psi_i}\)
8	0.0333578	251	7524.477
23	0.04603376	343	7451.0533
average:			7487.8

\(\widehat{V}(\hat{t}_{wr}) = \frac{1}{n} \frac{1}{n-1} \sum_{i \in \mathcal{S}} \left( \frac{t_i}{\psi_i} - \hat{t}_{wr} \right)^2 = \frac{1}{2} \frac{1}{(2-1)} \left[(7524.477-7487.8)^2 + (7451.0533-7487.8)^2 \right] = \frac{2695.5}{2} = 1347.6\)

\(\widehat{SE}(\hat{t}_{wr}) = \sqrt{1347.6} = 36.7\)

Sampling Design C: PPS Without Replacement

We know that for SRS, variance is reduced when sampling without replacement (compared to with replacement) – same holds for PPS!
But, now it is trickier to take a sample because the probability of selecting PSU \(i\) on the second draw is not the same as the probability of selecting PSU \(i\) on the first draw

Simple example illustrates this:
- \(P(\)select PSU 1 on 1st draw\() =\psi_1\), but:
  - \(P(\)select PSU 1 on 2nd draw \(|\) selected PSU 1 on 1st draw\()=0\)
  - \(P(\)select PSU 1 on 2nd draw \(|\) didn’t select PSU 1 on 1st draw\()>0\)

PPSWOR: Algorithms

Many (complex!) algorithms for selecting PPS samples, all require that:

You specify the inclusion probabilities, \(\pi_i = n \psi_i\)
Which imply the weights: \(w_i = \frac{1}{\pi_i} = \frac{1}{n \psi_i}\)

Note the implied restriction:
- Inclusion probabilities \((\pi_i)\) cannot be \(>\) 1
\[ \begin{split} \pi_i = n \psi_i &< 1 \\ \psi_i &< 1/n \\ \frac{X_i}{\sum_{i=1}^N X_i} &< 1/n \end{split} \]
No one PSU can be so big its \(X_i\) takes up more than \(1/n\) of the total of the \(X_i\)
If such a PSU exists, it should be selected with certainty (certainty unit)

PPSWOR: Estimating the Population Total

Horvitz-Thompson Estimator for the total under PPSWOR: \[\hat{t}_{HT} = \sum_{i \in \mathcal{S}} \frac{t_i}{\pi_i} = \sum_{i \in \mathcal{S}} \frac{t_i}{n \psi_i} = \sum_{i \in \mathcal{S}} w_i t_i\]

For the population of schools, I used Stata to take a PPSWOR sample of \(n=2\):

School	\(M_i\)	\(t_i\)	\(\psi_i\)	\(\pi_i = n \psi_i\)	\(w_i = \frac{1}{\pi_i}\)
14	531	267	\(\frac{531}{14989} = 0.035426\)	\(2*0.035426 = 0.070852\)	\(\frac{1}{0.070852} = 14.11\)
24	335	166	\(\frac{335}{14989} = 0.022350\)	\(2*0.022350 = 0.044699\)	\(\frac{1}{0.044699} = 22.37\)

My estimate of the total number of females: \[\hat{t}_{HT} = \sum_{i \in \mathcal{S}} w_i t_i = 14.11 \times 267 + 22.37 \times 166 = 7482\]

PPSWOR: Estimating the Variance of the Population Total

The variance of the PPSWOR estimator depends on two sets of probabilities:
1. Selection (Inclusion) probabilities: \(\pi_i = P(\)PSU \(i\) in sample\()\)
2. Pairwise Selection (Inclusion) probabilities: \(\pi_{ij}=P(\)PSU \(i\) and PSU \(j\) in sample\()\)
The \(\pi_{ij}\) are not just products of the \(\pi_i\)!

For \(n=2\):
- \(P(i\) first, \(j\) second\() = P(i\) first\()\times P(j\) second \(| i\) first\() = \psi_i \times \frac{\psi_j}{1-\psi_i}\)
- \(P(j\) first, \(i\) second\() = P(j\) first\()\times P(i\) second \(| j\) first\() = \psi_j \times \frac{\psi_i}{1-\psi_j}\)
- \(\pi_{ij} = P(i\) and \(j\) in sample\() = P(i\) first, \(j\) second\() + P(j\) first, \(i\) second\()\)
  \(\pi_{ij} = \psi_i \times \frac{\psi_j}{1-\psi_i}+ \psi_j \times \frac{\psi_i}{1-\psi_j}\)
Must have the \(\pi_{ij}\) to get an unbiased estimate of \(V(\hat{t}_{HT})\)

Complex!

PPSWOR: Estimating the Variance (con’t)

Theoretical variance of \(\hat{t}_{HT}\) is:

\[V(\hat{t}_{HT}) = \sum_{i=1}^N \frac{1- \pi_i}{\pi_i} t_i^2 + \sum_{i=1}^N \sum_{k=1, k\ne i}^N \frac{\pi_{ik}-\pi_i \pi_k}{\pi_i \pi_k}t_i t_k = \frac{1}{2} \sum_{i=1}^N \sum_{k=1, k\ne i}^N (\pi_i \pi_k - \pi_{ik}) \left(\frac{t_i}{\pi_i}-\frac{t_k}{\pi_k}\right)^2 \]

Horvitz-Thompson variance estimator (based on 1st expression) \[\widehat{V}_{HT}(\hat{t}_{HT}) = \sum_{i \in \mathcal{S}} (1-\pi_i)\frac{t_i^2}{\pi_i^2} + \sum_{i \in \mathcal{S}} \sum_{k \in \mathcal{S}, k \ne i} \frac{\pi_{ik}-\pi_i \pi_k}{\pi_{ik}} \frac{t_i}{\pi_i} \frac{t_k}{\pi_k}\]

Sen-Yates-Grundy variance estimator (based on 2nd expression) \[\widehat{V}_{SYG}(\hat{t}_{HT}) = \frac{1}{2}\sum_{i \in \mathcal{S}} \sum_{k \in \mathcal{S}, k \ne i} \frac{\pi_{ik}-\pi_i \pi_k}{\pi_{ik}} \left( \frac{t_i}{\pi_i} - \frac{t_k}{\pi_k} \right)^2\]

Though it’s the same theoretical variance, \(\widehat{V}_{HT}\) and \(\widehat{V}_{SYG}\) will not be the same

Problems With These Variance Estimators

Several problems with these variance estimators, \(\widehat{V}_{HT}(\hat{t}_{HT})\) and \(\widehat{V}_{SYG}(\hat{t}_{HT})\)
1. Publicly available datasets usually don’t have the joint inclusion probabilities included
  - Not practical! For a sample of size \(n\) there are \(n(n-1)\) pairwise inclusion probabilities!
2. For some designs the \(\pi_{ij}\) may be hard to calculate
3. \(\widehat{V}_{HT}(\hat{t}_{HT})\) and \(\widehat{V}_{SYG}(\hat{t}_{HT})\) can actually produce negative variance estimates for some designs!

For these reasons, we often use the with replacement variance estimate (the one on this slide) \[\widehat{V}_{WR}(\hat{t}_{HT}) = \frac{1}{n} \frac{1}{n-1} \sum_{i \in \mathcal{S}} \left( \frac{t_i}{\psi_i} - \hat{t}_{HT} \right)^2 = \frac{n}{n-1}\sum_{i \in \mathcal{S}} \left( \frac{t_i}{\pi_i} - \frac{\hat{t}_{HT}}{n} \right)^2\]
- Consequence: result is (usually slightly) conservative – i.e., variance too big
- Benefit: don’t need pairwise inclusion probabilities, and never get a negative variance!
- Convenience: \(\widehat{V}_{WR}(\hat{t}_{HT})\) is what is calculated in most software (including Stata and R)

Problems With These Variance Estimators (con’t)

For my PPSWOR sample that contains schools 14 and 24:
- Using with-replacement variance estimation
- Notice there is no FPC for PPS sampling

des.ppswor <- svydesign(id = ~1, weight = ~wt, data=samp_pps)
svytotal(~totfemale, des.ppswor)

           total     SE
totfemale 7482.1 54.728

\(\hat{t}_{HT} = 7482\) (same as the hand-calculation)
\(\widehat{SE}_{WR}(\hat{t}_{HT}) = 54.7\)
Compare to without-replacement variance estimation:
- \(\widehat{SE}_{HT}(\hat{t}_{HT}) = \text{N/A}\) (variance estimate is negative!)
- \(\widehat{SE}_{SYG}(\hat{t}_{HT}) = 53.1\) (slightly smaller than with-replacement estimate)

Activity 14.1 (Parts 1 & 2)

PPS Sampling (Parts 1 & 2)

Comparing the Designs

Two designs compared for sampling schools: (A) SRS, (B) PPSWOR
Both provide unbiased estimates of the true total
In terms of theoretical variances, if size variable is related to \(t\), \(V_{SRS} > V_{PPSWR} > V_{PPSWOR}\)
- In a single set of samples you might not see the gain (remember: variance is over repeated samples)
For the schools data, relationship between size and totals:

Scatterplot with number of students on x-axis and number of females on y-axis, showing a very linear relationship along an approximately 45 degree line.

Scatterplot with number of students on x-axis and number of smokers on y-axis, showing no clear pattern, a random scattering of points.

Let’s look at our estimates for total number of females and total number of smokers. What do we see?

Probability Proportional to Size Sampling