POLS 3220: How to Predict the Future
Previously, we introduced classification trees as a machine learning approach.
A flowchart-like prediction model.
Split the data into smaller and smaller subgroups based on predictor variables.
The “best” tree is the one that minimizes information entropy at each split.
The problem with trees is that they can get too complex.
We’ll introduce an approach that solves this overfitting problem in an clever and surprising way.
It’s called random forest (Breiman 2001).
An ensemble approach.
Creates a large number of classification trees and then asks them to “vote” on the correct prediction.
A few reasons I want to teach you this machine learning method:
When would we expect an ensemble approach—a large number of models voting on the correct prediction—to yield better predictions than a single model?
Condorcet Jury Theorem strikes again! Majority rule yields good predictions when:
Number of models is large
\(P(\text{correct}) > 0.5\)
Predictions are made independently
As always, independence is the trickiest criterion to satisfy.
If you train 500 classification trees on the same dataset, they won’t be independent at all.
In fact, they’ll be identical!
The problem is that they’re making predictions on the basis of the exact same set of information.
Random forest uses a clever trick to train a large number of independent trees.
Each classification tree is trained on a random subset of the data, and can only use a random selection of predictor variables at each split.
The result is a forest of classification trees, each of which is flawed in some way.
But as long as each individual tree is more likely than not to get the right answer, the forest as a whole willl be wise.
Predictor variables included: release date, genre, content rating, director, actors, distributor, runtime, and audience count.
I built 500 classification trees, each trained on a random subset of the data.
I asked each tree to predict the probability that a movie would fall into one of the three rating categories (Certified Fresh, Fresh, or Rotten).
Following best practices, I hid 20% of the data from the model as a test set, to assess how well it performs at out-of-sample prediction.
According to the Rotten Tomatoes Random Forest model, Roofman had a 69% chance of earning a Certified Fresh rating, and a 14% chance of being just regular Fresh.

Individual classification trees are almost certainly flawed in some way (either underfit or overfit).
But taking the average judgment of a large number of trees, trained on slightly different sets of data and predictor variables, will tend to cancel out their idiosyncratic errors.
Nearly every state-of-the-art machine learning technique today incorporates this insight.
Ensemble models almost always outperform a single complicated model.
Requires a lot of data.
Not very transparent.
Unlike the linear model, where it’s clear why it’s making a prediction, models like random forest are too complicated to get a good sense of what’s going on inside.
I don’t really have a clear sense of why the model gave Roofman such a high probability of critical acclaim.
Every machine learning approach rests on a crucial assumption:
“The future will be similar to the past”
For a machine learning model to make good predictions, we must assume that the patterns we observe in historical data will continue to hold true going forward.
Machine learning struggles to extrapolate beyond its training data.
If you encounter a truly unprecedented situation unlike anything in the training data, be skeptical of model-based predictions.
Before class next time, follow the instructions here to download R and RStudio on your computer.
Bring it to class, and we’ll start learning some basics of working with datasets and fitting machine learning models!