Bell Curves

POLS 3220: How to Predict the Future

Today’s Agenda

  • Introduce the bell curve probability distribution
    • AKA the normal distribution or the Gaussian distribution
  • Understand the conditions that create bell curves (the Central Limit Theorem)
  • Explore some useful features of bell curves for making predictions

Warmup

  • In an upcoming Congressional election:
    • 60% of voters plan to vote for the Republican candidate.
    • 40% of voters plan to vote for the Democratic candidate.
  • A polling firm randomly calls voters in the Congressional district and asks who they plan to vote for.

Draw a probability tree representing the first two voters contacted by the polling firm. What is the probability of every possible poll result?

Warmup

60%

40%

Start Poll

R

D

Warmup

60%

40%

60%

40%

60%

40%

Start Poll

R

D

RR (36%)

RD (24%)

DR (24%)

DD (16%)

Simplifying Assumption

Because we don’t care about the order of responses (just the counts), we can combine some outcomes:

60%

40%

60%

40%

60%

40%

Start Poll

1R,0D

0R,1D

2R,0D (36%)

1R,1D (48%)

0R,2D (16%)

Things To Notice

  • A poll with two responses is pretty worthless.

    • 48% of the time you get an equal number of R’s and D’s

    • 16% of the time, you get all D’s

    • 36% of the time, you get all R’s

  • Two big problems:

    • The poll is biased. Result is more likely to be wrong in one direction than the other.

    • It also has high variance. Result is, on average, very far from the truth.

Increasing Poll Size

Over the next few slides, we’ll show that increasing the size of the poll \((n)\) does three things:

  1. Eliminates bias
  2. Reduces variance
  3. Gives the polling errors a particular (bell curve) shape.

n=3

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

Start Poll

1R,0D

0R,1D

2R,0D

1R,1D

0R,2D

3R,0D (21.6%)

2R,1D (43.2%)

1R,2D (28.8%)

0R,3D (6.4%)

We can plot these compound probabilities on a bar chart.

n=4

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

Start Poll

1R,0D

0R,1D

2R,0D

1R,1D

0R,2D

3R,0D

2R,1D

1R,2D

0R,3D

4R,0D (13%)

3R,1D (34.5%)

2R,2D (34.5%)

1R,3D (15.4%)

0R,4D (2.6%)

I would never make you do this by hand.

n=5

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

Start Poll

1R,0D

0R,1D

2R,0D

1R,1D

0R,2D

3R,0D

2R,1D

1R,2D

0R,3D

4R,0D

3R,1D

2R,2D

1R,3D

0R,4D

5R,0D (7.8%)

4R,1D (25.9%)

3R,2D (34.6%)

2R,3D (23%)

1R,4D (7.7%)

0R,5D (1%)

Intuition Check: Why is the probability of 3R,2D so big and the probability of 0R,5D so small?

n=6

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

60%

40%

Start Poll

1R,0D

0R,1D

2R,0D

1R,1D

0R,2D

3R,0D

2R,1D

1R,2D

0R,3D

4R,0D

3R,1D

2R,2D

1R,3D

0R,4D

5R,0D

4R,1D

3R,2D

2R,3D

1R,4D

0R,5D

6D,0R (4.7%)

5D,1R (18.7%)

4D,2R (31.1%)

3D,3R (27.6%)

2D,4R (13.8%)

1D,5R (3.7%)

0D,6R (0.4%)

Things To Notice

  • With a large enough sample, the poll results are unbiased. Centered on the truth, and equally likely to be too high or too low.

  • Variance shrinks with poll size.

    • In a poll of 50 voters, there’s a strong chance you get a result off by 10 percentage points and call the election incorrectly.
    • In a poll with 500 voters, it’s practically impossible that the result will be off by more than 10 percentage points.

Central Limit Theorem

  • As poll size gets larger, the shape of the errors takes on that gorgeous bell curve shape.

  • This is one of the most foundational ideas in all of statistics.

Central Limit Theorem

If an outcome is the sum of a large number of independent random events, then it will fall on a bell curve.

Bell Curves In The Wild

Human height is the sum of a large number of independent genetic and environmental factors, so…

Bell Curves In The Wild

Human height is the sum of a large number of independent genetic and environmental factors, so… . . .

Bell Curves In The Wild

Standardized test scores are the sum of a large number of independent question scores, so…

Bell Curves In The Wild

College football scores are the sum of a large number of independent successes / failures to get the ball to the other end of the field, so…

College football scores (relative to the Vegas “spread”)

Bell Curves Are Nice

  • When outcomes fall on a bell curve, it makes prediction a lot easier.

  • That’s because outcomes are very unlikely to stray far from their expected values.

Bell Curves Are Nice

95% of poll results will be one of the red bars.

Bell Curves Are Nice

95% of poll results will be one of the red bars.

Margin of Error

  • Define the margin of error as the range within which you’re 95% sure your polling error will fall.

  • The back-of-the-envelope approximation of a poll’s margin of error is \(\frac{100\%}{\sqrt{n}}\).

    • So, for a poll with 100 respondents, margin of error is roughly \(\frac{100\%}{\sqrt{100}} = 10\%\).

    • Practice: what’s the margin of error for a poll with 400 respondents?

Wrap Up

  • Your outcome will fall on a bell curve if it is the sum of a large number of independent random events (Central Limit Theorem).

  • If the theorem holds, it’s great for making predictions, because bell curves are easy to work with.

    • In a few weeks, we’ll talk about the ways in which real-world polls fall short of this idealized model.
  • Next Time: What happens when that independence assumption is violated?