Text As Data: Federalist Paper Authorship

A great entry point for modeling text as data is the now classic problem of guessing who wrote 15 disputed Federalist Papers (Mosteller and Wallace 1964). To do so, we’ll look at the distribution of words that Hamilton, Madison, and Jay tend to write and try to detect their stylistic patterns in the disputed texts. We’ll use two approaches, outlined in Grimmer, Stewart, and Roberts (2021) Chapters 6 and 7: the multinomial model and the vector space model.

Get The Data

We’ll start by scraping the Federalist Paper corpus from Project Gutenberg.

library(tidyverse)
library(tidytext)
library(rvest)

# read the raw HTML
page <- read_html('https://www.gutenberg.org/cache/epub/18/pg18-images.html')

# get all the chapters
chapters <- html_elements(page, '.chapter')

# get just the text from each chapter
text <- html_text2(chapters)

d <- tibble(text)

# get rid of the slightly different version of Federalist 70
d <- d |>
  filter(str_detect(text, 'slightly different version', negate = TRUE))

# create a column for the title and attributed author
d <- d |>
  mutate(author = text |>
           str_extract('HAMILTON AND MADISON|HAMILTON OR MADISON|HAMILTON|MADISON|JAY') |>
           str_to_title(),
         title = str_extract(text, 'No. [A-Z].*'))

Multinomial Model

To motivate this model, suppose that each author owns a literal bag of words. When they are writing, they randomly draw each word from the bag, replacing it when they’re done writing it down. Evidently, Alexander Hamilton’s word bag has a lot of the word “upon”.

tidy_federalist <- d |> 
  # tokenize to the word level
  unnest_tokens(input = 'text',
                output = 'word')

# keep only a selection of stop words
interesting_words <- c('although', 'always', 
                     'commonly', 'consequently',
                     'considerable', 'heretofore', 
                     'upon', 'whilst')

tidy_federalist <- filter(tidy_federalist,
                          word %in% interesting_words)

library(wordcloud2)

tidy_federalist |> 
  filter(author == 'Hamilton') |> 
  count(word) |> 
  wordcloud2()

By comparison, James Madison’s bag of words has relatively more “whilst”.

tidy_federalist |> 
  filter(author == 'Madison') |> 
  count(word)

# A tibble: 7 × 2
  word             n
  <chr>        <int>
1 although         8
2 always           7
3 consequently    13
4 considerable     5
5 heretofore       1
6 upon             7
7 whilst          12

Obviously this model leaves out a lot of detail about how authors actually write. It ignores word order, syntax, meaning, context, and intent. But for certain questions, such a simplified model may nevertheless prove useful. Let’s consider Federalist Paper No. 18, counting the frequency of “upon”, “whilst”, and the other interesting stop words listed above.¹

tidy_federalist |> 
  filter(title == 'No. XVIII.') |> 
  count(word)

# A tibble: 4 × 2
  word             n
  <chr>        <int>
1 always           1
2 considerable     1
3 upon             1
4 whilst           1

What’s the likelihood that this set of word counts would have been generated by random draws from Hamilton’s bag of words, compared to Madison’s or Jay’s? (What’s the chance that Hamilton would have written a paper with so few “upon”s?) To estimate, we’ll first compute the word vectors for each author and for the disputed paper.

# get the frequencies of the interesting words in each author's corpus
bags_of_words <- tidy_federalist |>
  filter(author %in% c('Hamilton', 'Madison', 'Jay')) |>
  # convert these to factors so count() doesn't drop the zero counts
  mutate(author = factor(author),
         word = factor(word)) |> 
  count(author, word, .drop = FALSE) |>
  # sort the words in alphabetical order
  arrange(author, word)

# pull the vectors for Hamilton, Madison, and Jay
hamilton_vector <- bags_of_words |>
  filter(author == 'Hamilton') |>
  pull(n) |>
  # add names to make the vector more readable
  set_names(interesting_words)

hamilton_vector

    although       always     commonly consequently considerable 
           1           62           23            4           46 
  heretofore         upon       whilst 
          13          374            1

madison_vector <- bags_of_words |>
  filter(author == 'Madison') |>
  pull(n) |>
  # add names to make the vector more readable
  set_names(interesting_words)

madison_vector

    although       always     commonly consequently considerable 
           8            7            0           13            5 
  heretofore         upon       whilst 
           1            7           12

jay_vector <- bags_of_words |>
  filter(author == 'Jay') |>
  pull(n) |>
  # add names to make the vector more readable
  set_names(interesting_words)

jay_vector

    although       always     commonly consequently considerable 
           5            8            1            4            1 
  heretofore         upon       whilst 
           1            1            0

# now get the vector for Federalist No. 18
fed18_vector <- tidy_federalist |>
  mutate(word = factor(word)) |> 
  filter(title == 'No. XVIII.') |>
  count(word, .drop = FALSE) |>
  pull(n) |>
  # add names to make the vector more readable
  set_names(interesting_words)

fed18_vector

    although       always     commonly consequently considerable 
           0            1            0            0            1 
  heretofore         upon       whilst 
           0            1            1

Next we’ll use the dmultinom() function to estimate the likelihood that the fed18_vector would have been drawn from each of the authors’ bags of words.

dmultinom(x = fed18_vector,
          prob = hamilton_vector)

[1] 0.0003395527

dmultinom(x = fed18_vector,
          prob = madison_vector)

[1] 0.008942421

dmultinom(x = fed18_vector,
          prob = jay_vector)

[1] 0

The computation above makes a very strong assumption about Jay: because he never uses the word “whilst” in any of his Federalist papers, we assume that he would never ever use the word whilst in another paper. We can do better by regularizing our estimates, adding a small positive number to each vector (Laplace smoothing) to encode the possibility that Jay might someday use the word “whilst”, even if we’ve never seen him do it.

hamilton_likelihood <- dmultinom(x = fed18_vector,
                                 prob = hamilton_vector + 0.1)

madison_likelihood <- dmultinom(x = fed18_vector,
                                prob = madison_vector + 0.1)

jay_likelihood <- dmultinom(x = fed18_vector,
                            prob = jay_vector + 0.1)

# likelihood ratios
madison_likelihood / hamilton_likelihood

[1] 23.90813

madison_likelihood / jay_likelihood

[1] 85.56551

Since this paper is roughly 24 times more likely to have been generated from Madison’s bag of words over Hamilton’s (and roughly 86 times more likely than Jay’s), we can conclude with some degree of confidence that he was the author.

Vector Space Model

Another way to model the bag of words is to think of each set of word counts as a multidimensional vector. The angle between two such vectors gives us a sense of the two documents’ similarity to one another. If the angle is zero, then both documents have the exact same mix of words (though maybe one document is longer than the other). If the angle is 90 degrees, then the two documents are orthogonal – as different a mix of words as they possibly could be.

Cosine similarity captures this idea, because the cosine of 90 degrees is zero, and the cosine of 0 degrees is 1. Let’s compute the cosine similarity between Madison, Hamilton, and Jay’s known writings with the term vector from the disputed Federalist No. 18.

# define cosine similarity
cosine_similarity <- function(x1, x2){
  sum(x1*x2) / sqrt(sum(x1^2)) / sqrt(sum(x2^2))
}

# get the cosine similarity for Federalist 18 with all the authors
cosine_similarity(fed18_vector, hamilton_vector)

[1] 0.630843

cosine_similarity(fed18_vector, madison_vector)

[1] 0.6924889

cosine_similarity(fed18_vector, jay_vector)

[1] 0.4789131

This yields a similar result to what we found with the multinomial model. The cosine similarity between the disputed paper and Madison’s other papers is largest, suggesting he is the author.

Validation, Validation, Validation

Whenever you develop a method to measure or predict some quantity of interest, it is imperative that you first assess its performance against a known benchmark before applying it to new data. This process is called validation, and Grimmer, Stewart, and Roberts (2021) repeatedly emphasize how central it is to the text-as-data workflow. So the procedure we just created predicts that Madison wrote Federalist No. 18. How much should we trust that prediction?

To gain confidence in our approach, let’s see what it predicts for papers where authorship is not in dispute. If it correctly predicts the authors of the known texts, then we can be more certain it’s doing a good job with the unknown texts.

For the validation test, we’ll repeat the same steps as before, except we hold out the validation set when training the model. If we’re trying to predict whether Madison wrote Federalist 10, then it would be cheating to include Federalist 10 in the vector of things we know about Madison’s writing style.

# get the Federalist 10 word count vector
fed10_vector <- tidy_federalist |>
  mutate(word = factor(word)) |> 
  filter(title == 'No. X.') |>
  count(word, .drop = FALSE) |>
  pull(n) |>
  # add names to make the vector more readable
  set_names(interesting_words)

fed10_vector

    although       always     commonly consequently considerable 
           0            2            0            1            0 
  heretofore         upon       whilst 
           0            0            0

# Recompute Madison's word count vector, omitting Federalist 10
madison_vector <- tidy_federalist |> 
  mutate(word = factor(word)) |> 
  filter(title != 'No. X.',
         author == 'Madison') |> 
  count(word, .drop = FALSE) |> 
  pull(n) |> 
  set_names(interesting_words)

madison_vector

    although       always     commonly consequently considerable 
           8            5            0           12            5 
  heretofore         upon       whilst 
           1            7           12

dmultinom(x = fed10_vector,
          prob = hamilton_vector)

[1] 0.0003206053

dmultinom(x = fed10_vector,
          prob = madison_vector)

[1] 0.0072

dmultinom(x = fed10_vector,
          prob = jay_vector)

[1] 0.08292841

Already we can see a problem. The words we chose before may do a good job distinguishing between Hamilton and Madison (it’s inconceivable that Hamilton would have written an entire paper without a single use of the word “upon”, so the model correctly rules him out), but it struggles with distinguishing Jay’s style from Madison’s style, especially for a paper that only contains 3 of the chosen stop words. Because the word “always” appears the most frequently, the model incorrectly predicts that Jay wrote Federalist No. 10, not Madison.

Let’s see if we can find a set of discriminating words that do a better job distinguishing between all three authors.² To do so, I’ll start with the list of stop words from Mosteller and Wallace (1964) and add in the list of stop words from the tidytext package.

mw1964_words <- c("a", "all", "also", "an", "and", "any", "are", "as", "at", "be", "been", "but", "by", "can", "do", "down",
                  "even", "every", "for", "from", "had", "has", "have", "her", "his", "if", "in", "into", "is", "it", "its",
                  "may", "more", "must", "my", "no", "not", "now", "of", "on", "one", "only", "or", "our", "shall", "should",
                  "so", "some", "such", "than", "that", "the", "their", "then", "there", "things", "this", "to", "up", "upon",
                  "was", "were", "what", "when", "which", "who", "will", "with", "would", "your")

all_stopwords <- union(mw1964_words, get_stopwords()$word)

This yields a list of 186 potential words. Next, we’ll count up how frequently each author used each word, keeping only the words that one author used three times more frequently than another auhor (the number three is arbitrary, but as we’ll see it yields a manageable list of distinctive words).

tidy_federalist <- d |>
  # tokenize to the word level
  unnest_tokens(input = 'text',
                output = 'word') |>
  filter(word %in% all_stopwords) |>
  mutate(author = factor(author),
         word = factor(word))

frequency_table <- tidy_federalist |>
  filter(author %in% c('Hamilton', 'Jay', 'Madison')) |>
  count(author, word, .drop = FALSE) |>
  pivot_wider(names_from = 'author',
              values_from = 'n') |>
  # normalize by each author's number of words
  mutate(Hamilton = Hamilton / sum(Hamilton),
         Jay = Jay / sum(Jay),
         Madison = Madison / sum(Madison)) |>
  mutate(Hamilton_Jay = Hamilton / Jay,
         Hamilton_Madison = Hamilton / Madison,
         Jay_Madison = Jay / Madison) |>
  # just keep the words that one author uses 3 times more often than another
  filter(Hamilton_Jay > 3 |
           Hamilton_Jay < 0.33333 |
           Hamilton_Madison > 3 |
           Hamilton_Madison < 0.33333 |
           Jay_Madison > 3 |
           Jay_Madison < 0.33333)

This leaves us with 38 words to include in our multinomial model / document vectors.

interesting_words <- factor(frequency_table$word)

interesting_words

 [1] about      above      again      also       before     below     
 [7] did        down       during     every      further    having    
[13] here       hers       herself    himself    his        how       
[19] itself     me         my         myself     off        ought     
[25] our        ours       ourselves  over       she        theirs    
[31] there      through    up         upon       when       where     
[37] while      whom       yourselves
39 Levels: about above again also before below did down ... yourselves

Let’s see how the modified list performs on the validation test.

# keep only the new list of words
tidy_federalist <-
  tidy_federalist |>
  filter(word %in% interesting_words) |> 
  mutate(word = factor(word))

# compute the new Federalist 10 vector
fed10_vector <- tidy_federalist |>
  filter(title == 'No. X.') |>
  count(word, .drop = FALSE) |>
  pull(n) |>
  # add names to make the vector more readable
  set_names(interesting_words)

fed10_vector

     about      above      again       also     before      below 
         0          0          2          0          0          0 
       did       down     during      every    further     having 
         0          0          0          3          0          1 
      here       hers    herself    himself        his        how 
         1          0          0          1          8          0 
    itself         me         my     myself        off      ought 
         2          1          0          0          0          2 
       our       ours  ourselves       over        she     theirs 
         8          0          0          6          0          0 
     there    through         up       upon       when      where 
         6          2          1          0          2          2 
     while       whom yourselves 
         0          0          0

# get word count vectors for each author
madison_vector <- tidy_federalist |> 
  filter(title != 'No. X.',
         author == 'Madison') |> 
  count(word, .drop = FALSE) |> 
  pull(n) |> 
  set_names(interesting_words)

madison_vector

     about      above      again       also     before      below 
         1          2         11         31         13          2 
       did       down     during      every    further     having 
         3          3         11         91         18          7 
      here       hers    herself    himself        his        how 
        23          1          2          6         49         13 
    itself         me         my     myself        off      ought 
        27          3          4          4          1         59 
       our       ours  ourselves       over        she     theirs 
        41          0          3         37          3          2 
     there    through         up       upon       when      where 
        29          6          2          7         20         35 
     while       whom yourselves 
         0          9          0

hamilton_vector <- tidy_federalist |> 
  filter(author == 'Hamilton') |> 
  count(word, .drop = FALSE) |> 
  pull(n) |> 
  set_names(interesting_words)

hamilton_vector

     about      above      again       also     before      below 
        20          5          5         36         31          2 
       did       down     during      every    further     having 
        12         14         17        179         33         32 
      here       hers    herself    himself        his        how 
        33          0          5         34        239         53 
    itself         me         my     myself        off      ought 
        81         15         39         15          3        159 
       our       ours  ourselves       over        she     theirs 
       191          7         21         65         16          1 
     there    through         up       upon       when      where 
       379         27         34        374        110         86 
     while       whom yourselves 
        36         34          0

jay_vector <- tidy_federalist |> 
  filter(author == 'Jay') |> 
  count(word, .drop = FALSE) |> 
  pull(n) |> 
  set_names(interesting_words)

jay_vector

     about      above      again       also     before      below 
         0          2          1         11          0          0 
       did       down     during      every    further     having 
         6          0          0          5          0          5 
      here       hers    herself    himself        his        how 
         1          0          0          0          5          9 
    itself         me         my     myself        off      ought 
         1          3          3          0          1          2 
       our       ours  ourselves       over        she     theirs 
        38          0          3          2          4          1 
     there    through         up       upon       when      where 
        10          0          0          1         15          2 
     while       whom yourselves 
         2         10          1

dmultinom(x = fed10_vector,
          prob = madison_vector + 0.1) /
dmultinom(x = fed10_vector,
          prob = hamilton_vector + 0.1)

[1] 76.72475

dmultinom(x = fed10_vector,
          prob = madison_vector + 0.1) /
dmultinom(x = fed10_vector,
          prob = jay_vector + 0.1)

[1] 1.164301e+12

cosine_similarity(madison_vector, fed10_vector)

[1] 0.7437931

cosine_similarity(hamilton_vector, fed10_vector)

[1] 0.7210126

cosine_similarity(jay_vector, fed10_vector)

[1] 0.6583766

This is much nicer. Both methods now correctly identify Madison as the author of Federalist 10. A more thorough model-development loop might iterate through validation and refinement a few more times, holding out a different validation set each time. But this is the basic workflow.

Practice Problems

Conduct more validation tests on undisputed Federalist Papers to see if the list of words we came up with in the last section does a good job.
See if you can create a model that accurately classifies whether a State of the Union speech was delivered in the 19th or 20th century.

Grimmer, Justin, Brandon M. Stewart, and Margaret E. Roberts. 2021. Text as Data: A New Framework for Machine Learning and the Social Sciences. S.l.: Princeton University Press.

Imai, Kosuke. 2017. Quantitative Social Science: An Introduction. Princeton: Princeton University Press.

Mosteller, Frederick, and David L. Wallace. 1964. Inference and Disputed Authorship: The Federalist. Addison-Wesley.

This list was inspired by Chapter 5 of Imai (2017). More on how to identify discriminating words in a later section.↩︎
See Grimmer, Stewart, and Roberts (2021) Chapter 11 for more on this process.↩︎

Federalist Paper Authorship

Get The Data

Multinomial Model

Vector Space Model

Validation, Validation, Validation

Practice Problems

References