Tree Models

POLS 3220: How to Predict the Future

Today’s Agenda

  • Last time, we introduced linear models as a machine learning tool.

    • The key idea was to make predictions based on a linear combination of predictor variables.
  • Today, we’ll discuss an approach that combines variables in a nonlinear fashion, called classification and regression trees (CART).

    • It looks exactly like the probability trees we worked with in the first half of the semester.

    • Except we let computer to build them for us!

Motivating Problem

  • Can we predict a movie’s Rotten Tomatoes rating?

  • Predictor Variables: Content Rating, Genre, Release Date, Distributor, Runtime, Number of Audience Reviews

Outside View

If we had no other information, what would we predict based solely on historical base rates?

Rating Total %
3,259 18.4
6,844 38.7
7,565 42.8

We can do better!

Classification Trees

  • A classification tree works by recursively partitioning the dataset (i.e. repeatedly splitting the data into smaller subsets).

All Movies

Criterion/Searchlight/A24

Other Distributor

Audience Count >= 10,000

Audience Count < 10,000

PG

Other Rating

PG-13 or R

Other Rating

Classification Trees

With each partition, your prediction becomes less biased.

Criterion, Searchlight, or A24

Rating Total %
118 37.2
135 42.6
64 20.2

Other Distributors

Rating Total %
3,105 18.4
6,396 37.9
7,357 43.6

Classification Trees

With each partition, your prediction becomes less biased.

Audience Count < 10,000

Rating Total %
1,293 12.0
5,052 46.9
4,427 41.1

Audience Count >= 10,000

Rating Total %
1,811 30.8
1,193 20.3
2,870 48.9

Choosing Splits

  • What is the “best” way to partition the data?

  • Up to now, we’ve been choosing our splits based purely on vibes.

  • A more principled way is to partition the data based on how much information it reveals.

Information Theory: Review

Recall the definitions of information and entropy.

  • Information is how surprised we are when we learn an outcome. Equal to \(-log(p)\).

  • Entropy is “expected surprise”: how much surprise we experience on average when we learn the outcome.

  • If we could perfectly predict the Rotten Tomatoes rating a of movie, then entropy would be zero. (We would never be surprised).

  • Therefore, our optimal strategy is to seek out information that reduces entropy as much as possible.

Information Theory: Review

Rating Total p
3,259 0.184
6,844 0.387
7,565 0.428

Information Theory: Review

Rating Total p -log(p)
3,259 0.184 2.44
6,844 0.387 1.37
7,565 0.428 1.22

\(\text{Entropy} = -\sum p\times log(p)=\) 1.5 bits.

Information Gain

  • Each time you partition the dataset, it reduces entropy.

  • This is because you become more certain about your prediction, therefore less likely to be surprised!

  • The information gain from a partition equals how much it reduces entropy.

Information Gain

Audience Count < 9,000

Rating Total %
1,328 12.1
5,161 47.1
4,462 40.7

Entropy: 1.41 bits

Audience Count >= 9,000

Rating Total %
1,894 31.5
1,219 20.3
2,899 48.2

Entropy: 1.5 bits

Classification Trees

  • Trying to find the best partition by hand would be tremendously tedious.

  • But computers are great at it.

  • When creating a classification tree, the computer will check thousands of possible partitions, see how much it reduces entropy, and pick the best one.

Classification Trees

How Complex Should The Tree Get?

Bias-Variance Tradeoff

  • It’s our old friend, the bias-variance tradeoff!

  • Outside View (zero partitions) is biased.

  • Inside View (lots of partitions) has higher variance, because your predictions are based on less data.

  • Sweet spot is somewhere in the middle.

Bias-Variance Tradeoff

  • In machine learning, this tradeoff is called overfitting vs. underfitting.

Next Time

  • Next time, we’ll talk about how to hit that “sweet spot” between overfitting and underfitting.

  • And we’ll show how you can harness the wisdom of crowds with machine learning models.

  • Spoiler alert: we’re going to take a bunch of different classification trees and ask them to “vote” on the best prediction.

  • The resulting machine learning model is called random forest.