Text As Data

Overview

This site is intended to serve as a companion to Grimmer, Stewart, and Roberts (2021), an excellent book on how to think about text as data, which makes a deliberate choice to omit code when describing their examples.¹ Thus the need for this R code supplement, which was developed during my Summer 2022 graduate-level Text As Data course at the University of Georgia. All the code and data necessary to replicate the results on this site are available at the GitHub link on the upper right.

The site is divided into three sections, corresponding to the three stages of any text-as-data workflow:

Harvest the Text: How to carefully choose what texts to include in your corpus, and how to get them from some messy format like HTML or PDF into a plaintext dataframe.
Tidy the Text: How to represent large amounts of text quantitatively, and what choices you need to make during the preprocessing stage.
Model the Text: How to build a model to meet your objective, be it prediction, classification, causal inference, or exploration.

For each stage in the workflow, there are a number of useful R packages that can help accomplish these tasks, including webscraping (rvest), optical character recognition (tesseract), tidying (tidytext), topic modeling (topicmodels), sentiment analysis (sentimentR), and many others. On this site, we will walk through several tutorials of these packages – motivated by political science applications – with links to more detailed documentation for those interested in exploring further.

Grimmer, Justin, Brandon M. Stewart, and Margaret E. Roberts. 2021. Text as Data: A New Framework for Machine Learning and the Social Sciences. S.l.: Princeton University Press.

Wisely, in my view, as books with code can quickly become dated.↩︎

Text As Data

Overview

References