Random Forests

POLS 3220: How to Predict the Future

Recap

  • Previously, we introduced classification trees as a machine learning approach.

    • A flowchart-like prediction model.

    • Split the data into smaller and smaller subgroups based on predictor variables.

    • The “best” tree is the one that minimizes information entropy at each split.

  • The problem with trees is that they can get too complex.

    • Such trees overfit to patterns in the dataset, and may do a poor job predicting out-of-sample.

Today’s Agenda

  • We’ll introduce an approach that solves this overfitting problem in an clever and surprising way.

  • It’s called random forest (Breiman 2001).

    • An ensemble approach.

    • Creates a large number of classification trees and then asks them to “vote” on the correct prediction.

  • A few reasons I want to teach you this machine learning method:

    1. Nearly state-of-the-art prediction performance in machine learning competitions.
    2. Works very well “out-of-the-box”. Can implement it with just a few lines of code.
    3. If you understand how classification trees work, then it’s straightforward to grasp what random forest is doing.

Ensemble Models

Ensemble Models

  • When would we expect an ensemble approach—a large number of models voting on the correct prediction—to yield better predictions than a single model?

  • Condorcet Jury Theorem strikes again! Majority rule yields good predictions when:

    • Number of models is large

    • \(P(\text{correct}) > 0.5\)

    • Predictions are made independently

  • As always, independence is the trickiest criterion to satisfy.

    • If you train 500 classification trees on the same dataset, they won’t be independent at all.

    • In fact, they’ll be identical!

    • The problem is that they’re making predictions on the basis of the exact same set of information.

Random Forest

  • Random forest uses a clever trick to train a large number of independent trees.

  • Each classification tree is trained on a random subset of the data, and can only use a random selection of predictor variables at each split.

    • This is why it’s called “random” forest.
  • The result is a forest of classification trees, each of which is flawed in some way.

  • But as long as each individual tree is more likely than not to get the right answer, the forest as a whole willl be wise.

The Rotten Tomatoes Random Forest

  • Predictor variables included: release date, genre, content rating, director, actors, distributor, runtime, and audience count.

  • I built 500 classification trees, each trained on a random subset of the data.

  • I asked each tree to predict the probability that a movie would fall into one of the three rating categories (Certified Fresh, Fresh, or Rotten).

  • Following best practices, I hid 20% of the data from the model as a test set, to assess how well it performs at out-of-sample prediction.

Out-of-Sample Prediction

Out-of-Sample Prediction

Out-of-Sample Prediction

But What About Roofman???

But What About Roofman???

According to the Rotten Tomatoes Random Forest model, Roofman had a 69% chance of earning a Certified Fresh rating, and a 14% chance of being just regular Fresh.

Advantages of Random Forest

  • Individual classification trees are almost certainly flawed in some way (either underfit or overfit).

  • But taking the average judgment of a large number of trees, trained on slightly different sets of data and predictor variables, will tend to cancel out their idiosyncratic errors.

  • Nearly every state-of-the-art machine learning technique today incorporates this insight.

  • Ensemble models almost always outperform a single complicated model.

Limitations

  • Requires a lot of data.

  • Not very transparent.

    • Unlike the linear model, where it’s clear why it’s making a prediction, models like random forest are too complicated to get a good sense of what’s going on inside.

    • I don’t really have a clear sense of why the model gave Roofman such a high probability of critical acclaim.

  • Every machine learning approach rests on a crucial assumption:

    • “The future will be similar to the past”

    • For a machine learning model to make good predictions, we must assume that the patterns we observe in historical data will continue to hold true going forward.

    • Machine learning struggles to extrapolate beyond its training data.

    • If you encounter a truly unprecedented situation unlike anything in the training data, be skeptical of model-based predictions.

Next Time

  • Before class next time, follow the instructions here to download R and RStudio on your computer.

  • Bring it to class, and we’ll start learning some basics of working with datasets and fitting machine learning models!

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.