Week 11

Sampling Distributions

Readings

For an (optional) deeper dive into the mathematics of probability theory:

Overview

Okay. So we have an estimate we computed from our data. Maybe we’ve even done enough theoretical legwork to believe that our estimate reflects a causal relationship. We’re feeling pretty good about ourselves.

But we’re not done yet, because that estimate we calculated is based on limited information about the world. Maybe our data only represents a sample from a much larger population. There’s no guarantee that the estimate we get from that sample is that same one we would get if we could observe everyone in the population. And even if we’ve collected data from the entire population, don’t forget about the Fundamental Problem of Causal Inference – we can never know with certainty what would have happened under some counterfactual treatment regime.1 So we need a way to express our uncertainty about the estimates we produced. How likely is it that our result was just caused by a weird sample?

This week, we discuss the fundamental concept underlying all of statistical inference2 – the sampling distribution.

Class Notes

# load the CES dataset
load('data/ces-2020/cleaned-CES.RData')

# recode TV news viewership as a 0-1 variable
ces <- mutate(ces, tv_news_24h = as.numeric(tv_news_24h == 'Yes'))

What percent of this population (CES respondents) watched TV news in the past 24 hours?

mean(ces$tv_news_24h)
[1] 0.6058689

Drawing a Sample

But suppose we just had a sample from this population, and we were trying to estimate that population mean? How weird can that sample get? And how far can a sample statistic get from the true value? To see, let’s play around with drawing samples of CES respondents.

# pull a sample
sample_data <- slice_sample(ces, n = 100)

# what percent of the sample watched TV news recently?
mean(sample_data$tv_news_24h)
[1] 0.54

Your result may differ! (It’s a random sample, so there’s some…randomness involved.) But the estimate I got was 54%. Not terrible, but not terribly accurate either.

Here’s a custom function that draws a sample and returns the percent of respondents in that sample that watch TV news. You should run that function a few times to see how much variability we get from sample to sample.

get_sample_mean <- function(sample_size = 100){
  ces %>% 
    # draw a sample
    slice_sample(n = sample_size) %>%
    # pull the TV news variable
    pull(tv_news_24h) %>% 
    # return the mean
    mean
}

get_sample_mean()
[1] 0.58
get_sample_mean()
[1] 0.61
get_sample_mean(sample_size = 200)
[1] 0.615
get_sample_mean(sample_size = 10)
[1] 0.6

The Sampling Distribution

Now here’s the question that motivates basically all of statistics. How weird can our sample estimates get? On average, do they tend to be close to the truth, or can they be pretty far off the mark? To address this question, let’s introduce the sampling distribution – the distribution of outcomes we would expect to observe if we repeatedly drew samples and computed the estimate.3

# we'll use the replicate() function to repeat that sampling process a large number of times
sampling_distribution <- replicate( 25000, get_sample_mean() )

The object sampling_distribution is a vector with 25,000 sample means. Each one was computed from an independent random sample from the CES dataset. The first sample estimate I got was 54%, but what does this set of 25,000 estimates look like?

head(sampling_distribution)
[1] 0.62 0.60 0.66 0.57 0.54 0.53
ggplot(mapping = aes(x = sampling_distribution)) +
  geom_histogram(color = 'black', binwidth = 0.01) +
  theme_minimal() +
  labs(x = 'Estimate', y = 'Number of Samples',
       title = 'The Sampling Distribution of the Mean')

If I draw a sample of 100 people from a population where 60% have some characteristic, then this is the range of outcomes I should expect to observe. Most of the time, we’ll get a sample estimate between 55% and 65%. An estimate smaller than 50% or larger than 70% is unlikely, but possible.4

Expected Value

What is the mean value of the sampling distribution?

mean(sampling_distribution)
[1] 0.6060284

This is the expected value of our sample statistic – the average value that you would expect to get if you repeatedly sampled from the population. In this case, notice how close the expected value is to the truth! We say that an estimator is unbiased if its expected value equals the true population parameter.

So we just demonstrated that the mean of a random sample is an unbiased estimator of the population mean. No big deal. Just the foundation of all statistics and polling. That if I draw 100 people at random and ask them a question, their responses will, on average, give us a good approximation of what all 300 million Americans would say.

But averages can be deceiving! We don’t just want to know whether we’ll be right on average. We want to know how far from the truth we might end up with our particular sample.

Standard Error

What is the standard deviation of our sampling distribution?5

sd(sampling_distribution)
[1] 0.04884307

The standard deviation of a sampling distribution is is called the standard error, and it’s a very important for understanding hypothesis tests. The larger the standard error, the wider the range of sample statistics that could have been computed from our population, and the less certain we should be about our particular value.

The Central Limit Theorem

Next, notice the shape of the sampling distribution in the figure above. It’s this nice bell curve you may have seen before, called a “normal” distribution (aka a Gaussian distribution).6 Why is the sampling distribution normally shaped? Well that’s one of the most fascinating and magical theorems in all of statistics: the Central Limit Theorem. The sampling distribution of the mean is approximately normally distributed – as long as you have a sufficiently large7 sample size.

Intuition: some samples will, by chance, contain an unrepresentatively large number of TV watchers. Some will contain an unrepresentatively small number of TV watchers. But most of the time, the TV watchers and the non-TV watchers will cancel each other out, such that the mass of the distribution is centered around the truth.

There are a bunch of nice simulations that showing the Central Limit Theorem in action. In class, you’ll get to play around with one.

Normal distributions are nice. They allow us to say precisely how far a sample estimate is likely to deviate from the truth. For example, we know that about 68% of a normal distribution falls within 1 standard deviation of the mean.

expected_value <- mean(sampling_distribution)
standard_error <- sd(sampling_distribution) 
num_draws <- 25000

within_1sd <- sum(sampling_distribution > expected_value - standard_error &
      sampling_distribution < expected_value + standard_error)

within_1sd / num_draws
[1] 0.69112

And about 95% of observations fall within two standard deviations.

within_2sd <- sum(sampling_distribution > expected_value - 2*standard_error &
                        sampling_distribution < expected_value + 2*standard_error)

within_2sd / num_draws
[1] 0.95992

And about 99.7% fall within 3 standard deviations. It’s super rare to get an observation that far away.

within_3sd <- sum(sampling_distribution > expected_value - 3*standard_error &
                        sampling_distribution < expected_value + 3*standard_error)

within_3sd / num_draws
[1] 0.9984

Margin of Error and Confidence Intervals

So if I draw a sample of 100 individuals, then 95% of the time, my sample proportion will be within…

2*sd(sampling_distribution)
[1] 0.09768615

…of the truth. This is what pollsters call the margin of error.

Another way of describing this sampling variability is with a confidence interval. You construct a confidence interval by taking your sample estimate plus or minus 2 standard errors. Because the sample mean is within 2 standard deviations of the truth 95% of the time, the 95% confidence interval will contain the true value approximately 95% of the time.

The Law of Large Numbers

What happens to the standard error when you increase your sample size? Intuitively, you’re getting more data, so you should be more and more confident in your estimate.

Let’s draw a sampling distribution with 400 instead of 100 people.

sampling_distribution <- replicate(25000,
                                   get_sample_mean(sample_size = 400))
ggplot(mapping = aes(x = sampling_distribution)) +
  geom_histogram(color = 'black', binwidth = 0.01) +
  theme_minimal() +
  labs(x = 'Estimate', y = 'Number of Samples',
       title = 'The Sampling Distribution of the Mean')

Check out that sampling variability! Previously, we got values as high as 0.8 and as low as 0.4. Now the highest and lowest values are more like 0.7 and 0.5.

Note that our estimate is still unbiased (centered around the truth):

mean(sampling_distribution)
[1] 0.6059747

But the standard error is half as large.

sd(sampling_distribution)
[1] 0.02415349

This is a general fact about the standard error of the mean. If the population standard deviation equals \(\sigma\), then

\[ \text{Standard Error of the Mean} = \frac{\sigma}{\sqrt{n}} \]

For those interested in the derivation, it’s in the footnotes.8

If the population distribution has higher variance, so too will your sampling distribution, because you can draw a wider range of samples. And if you draw a larger sample, the variance of the sampling distribution decreases. More precisely, the standard error decreases with the square root of \(n\). If you quadruple your sample size, you cut standard error in half.

In the limit, as your sample size gets bigger and bigger, the standard error approaches (but never quite reaches) zero. This is the Law of Large Numbers. Bigger datasets mean more confidence in your estimate. Formally, we say that the sample mean converges in probability to its expected value.

Problem Set

Imagine we were interested in studying whether American men were more or less likely than American women to support import tariffs on Chinese goods.

  1. Load data/ces-2020/cleaned-CES.RData.
  2. What percent of CES respondents support tariffs on China?
  3. What percent of women support tariffs on China? What percent of men support tariffs on China? What is the difference between these two values (aka the difference in means)?
  4. Write a function that draws a sample of 100 respondents and returns a sample estimate of the difference between men and women’s support for tariffs.
  5. Using that function, generate a sampling distribution for this difference in means.
  6. Plot your sampling distribution. Confirm that it is centered around the truth.
  7. How far away from the truth can these sample estimates get? What is the maximum value? What is the minimum value? What is the standard deviation of the sampling distribution?
Abadie, Alberto, Susan Athey, Guido W. Imbens, and Jeffrey M. Wooldridge. 2020. “Sampling-Based Versus Design-Based Uncertainty in Regression Analysis.” Econometrica 88 (1): 265–96. https://doi.org/10.3982/ECTA12675.
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in r and Stan. 2nd ed. CRC Texts in Statistical Science. Boca Raton: Taylor; Francis, CRC Press.

  1. For a thorough review of the distinction between sampling-based and design-based uncertainty, see Abadie et al. (2020).↩︎

  2. Well, frequentist statistical inference. Bayesians think about uncertainty in a different way (McElreath 2020).↩︎

  3. The sampling distribution is a pure thought experiment, because in real life we can’t actually do this (unless we had a lot of money to repeat the same survey thousands of times).↩︎

  4. And the chance of drawing a sample with 0% is about 1 in ten quintillion-trillion-billion. It will approximately never happen.↩︎

  5. Recall that the variance of a distribution is the average squared distance from the mean, and the standard deviation is the square root of variance.↩︎

  6. Technically it’s a binomial distribution, but we can reserve that distinction for a footnote because binomial distributions are approximately normal when the sample size is large enough.↩︎

  7. How large is “sufficiently large”? I don’t want to give a definitive number, because people will yell at me on Twitter. But, as you see in the figure above, 100 is clearly big enough. 50 might even be okay. Less than 25 is pushing your luck.↩︎

  8. The sample mean is the sum of your all your sample values \((X)\) divided by the sample size \((n)\). \(\bar{X} = \frac{\sum{X_i}}{n}\). The variance of that statistic is \(\text{Var}(\bar{X}) = \text{Var}\left(\frac{\sum{X_i}}{n}\right)\). One useful property of variance is that \(\text{Var}(ax) = a^2\text{Var}(X)\) for any constant \(a\). If \(\text{Var}(X_i) = \sigma^2\), then \(\text{Var}\left(\frac{\sum{X_i}}{n}\right) = \frac{1}{n^2} (n\sigma) = \frac{\sigma^2}{n}\). The square root of that variance, \(\frac{\sigma}{\sqrt{n}}\), is the standard error of the mean.↩︎

References