Modeling the bag of words.
A great entry point for modeling text as data is the now classic problem of guessing who wrote 15 disputed Federalist Papers (Mosteller and Wallace 1964). To do so, we’ll look at the distribution of words that Hamilton, Madison, and Jay tend to write and try to detect their stylistic patterns in the disputed texts. We’ll use two approaches, outlined in Grimmer, Stewart, and Roberts (2021) Chapters 6 and 7: the multinomial model and the vector space model.
We’ll start by scraping the Federalist Paper corpus from Project Gutenberg.
library(tidyverse)
library(tidytext)
library(rvest)
# read the raw HTML
page <- read_html('https://www.gutenberg.org/cache/epub/18/pg18-images.html')
# get all the chapters
chapters <- html_elements(page, '.chapter')
# get just the text from each chapter
text <- html_text2(chapters)
d <- tibble(text)
# get rid of the slightly different version of Federalist 70
d <- d |>
filter(str_detect(text, 'slightly different version', negate = TRUE))
# create a column for the title and attributed author
d <- d |>
mutate(author = text |>
str_extract('HAMILTON AND MADISON|HAMILTON OR MADISON|HAMILTON|MADISON|JAY') |>
str_to_title(),
title = str_extract(text, 'No. [A-Z].*'))
To motivate this model, suppose that each author owns a literal bag of words. When they are writing, they randomly draw each word from the bag, replacing it when they’re done writing it down. Evidently, Alexander Hamilton’s word bag has a lot of the word “upon”.
tidy_federalist <- d |>
# tokenize to the word level
unnest_tokens(input = 'text',
output = 'word')
# keep only a selection of stop words
interesting_words <- c('although', 'always',
'commonly', 'consequently',
'considerable', 'heretofore',
'upon', 'whilst')
tidy_federalist <- filter(tidy_federalist,
word %in% interesting_words)
library(wordcloud2)
tidy_federalist |>
filter(author == 'Hamilton') |>
count(word) |>
wordcloud2()
By comparison, James Madison’s bag of words has relatively more “whilst”.
# A tibble: 7 × 2
word n
<chr> <int>
1 although 8
2 always 7
3 consequently 13
4 considerable 5
5 heretofore 1
6 upon 7
7 whilst 12
Obviously this model leaves out a lot of detail about how authors actually write. It ignores word order, syntax, meaning, context, and intent. But for certain questions, such a simplified model may nevertheless prove useful. Let’s consider Federalist Paper No. 18, counting the frequency of “upon”, “whilst”, and the other interesting stop words listed above.1
# A tibble: 4 × 2
word n
<chr> <int>
1 always 1
2 considerable 1
3 upon 1
4 whilst 1
What’s the likelihood that this set of word counts would have been generated by random draws from Hamilton’s bag of words, compared to Madison’s or Jay’s? (What’s the chance that Hamilton would have written a paper with so few “upon”s?) To estimate, we’ll first compute the word vectors for each author and for the disputed paper.
# get the frequencies of the interesting words in each author's corpus
bags_of_words <- tidy_federalist |>
filter(author %in% c('Hamilton', 'Madison', 'Jay')) |>
# convert these to factors so count() doesn't drop the zero counts
mutate(author = factor(author),
word = factor(word)) |>
count(author, word, .drop = FALSE) |>
# sort the words in alphabetical order
arrange(author, word)
# pull the vectors for Hamilton, Madison, and Jay
hamilton_vector <- bags_of_words |>
filter(author == 'Hamilton') |>
pull(n) |>
# add names to make the vector more readable
set_names(interesting_words)
hamilton_vector
although always commonly consequently considerable
1 62 23 4 46
heretofore upon whilst
13 374 1
madison_vector <- bags_of_words |>
filter(author == 'Madison') |>
pull(n) |>
# add names to make the vector more readable
set_names(interesting_words)
madison_vector
although always commonly consequently considerable
8 7 0 13 5
heretofore upon whilst
1 7 12
jay_vector <- bags_of_words |>
filter(author == 'Jay') |>
pull(n) |>
# add names to make the vector more readable
set_names(interesting_words)
jay_vector
although always commonly consequently considerable
5 8 1 4 1
heretofore upon whilst
1 1 0
# now get the vector for Federalist No. 18
fed18_vector <- tidy_federalist |>
mutate(word = factor(word)) |>
filter(title == 'No. XVIII.') |>
count(word, .drop = FALSE) |>
pull(n) |>
# add names to make the vector more readable
set_names(interesting_words)
fed18_vector
although always commonly consequently considerable
0 1 0 0 1
heretofore upon whilst
0 1 1
Next we’ll use the dmultinom()
function to estimate the likelihood that the fed18_vector
would have been drawn from each of the authors’ bags of words.
dmultinom(x = fed18_vector,
prob = hamilton_vector)
[1] 0.0003395527
dmultinom(x = fed18_vector,
prob = madison_vector)
[1] 0.008942421
dmultinom(x = fed18_vector,
prob = jay_vector)
[1] 0
The computation above makes a very strong assumption about Jay: because he never uses the word “whilst” in any of his Federalist papers, we assume that he would never ever use the word whilst in another paper. We can do better by regularizing our estimates, adding a small positive number to each vector (Laplace smoothing) to encode the possibility that Jay might someday use the word “whilst”, even if we’ve never seen him do it.
hamilton_likelihood <- dmultinom(x = fed18_vector,
prob = hamilton_vector + 0.1)
madison_likelihood <- dmultinom(x = fed18_vector,
prob = madison_vector + 0.1)
jay_likelihood <- dmultinom(x = fed18_vector,
prob = jay_vector + 0.1)
# likelihood ratios
madison_likelihood / hamilton_likelihood
[1] 23.90813
madison_likelihood / jay_likelihood
[1] 85.56551
Since this paper is roughly 24 times more likely to have been generated from Madison’s bag of words over Hamilton’s (and roughly 86 times more likely than Jay’s), we can conclude with some degree of confidence that he was the author.
Another way to model the bag of words is to think of each set of word counts as a multidimensional vector. The angle between two such vectors gives us a sense of the two documents’ similarity to one another. If the angle is zero, then both documents have the exact same mix of words (though maybe one document is longer than the other). If the angle is 90 degrees, then the two documents are orthogonal – as different a mix of words as they possibly could be.
Cosine similarity captures this idea, because the cosine of 90 degrees is zero, and the cosine of 0 degrees is 1. Let’s compute the cosine similarity between Madison, Hamilton, and Jay’s known writings with the term vector from the disputed Federalist No. 18.
# define cosine similarity
cosine_similarity <- function(x1, x2){
sum(x1*x2) / sqrt(sum(x1^2)) / sqrt(sum(x2^2))
}
# get the cosine similarity for Federalist 18 with all the authors
cosine_similarity(fed18_vector, hamilton_vector)
[1] 0.630843
cosine_similarity(fed18_vector, madison_vector)
[1] 0.6924889
cosine_similarity(fed18_vector, jay_vector)
[1] 0.4789131
This yields a similar result to what we found with the multinomial model. The cosine similarity between the disputed paper and Madison’s other papers is largest, suggesting he is the author.
Whenever you develop a method to measure or predict some quantity of interest, it is imperative that you first assess its performance against a known benchmark before applying it to new data. This process is called validation, and Grimmer, Stewart, and Roberts (2021) repeatedly emphasize how central it is to the text-as-data workflow. So the procedure we just created predicts that Madison wrote Federalist No. 18. How much should we trust that prediction?
To gain confidence in our approach, let’s see what it predicts for papers where authorship is not in dispute. If it correctly predicts the authors of the known texts, then we can be more certain it’s doing a good job with the unknown texts.
For the validation test, we’ll repeat the same steps as before, except we hold out the validation set when training the model. If we’re trying to predict whether Madison wrote Federalist 10, then it would be cheating to include Federalist 10 in the vector of things we know about Madison’s writing style.
# get the Federalist 10 word count vector
fed10_vector <- tidy_federalist |>
mutate(word = factor(word)) |>
filter(title == 'No. X.') |>
count(word, .drop = FALSE) |>
pull(n) |>
# add names to make the vector more readable
set_names(interesting_words)
fed10_vector
although always commonly consequently considerable
0 2 0 1 0
heretofore upon whilst
0 0 0
# Recompute Madison's word count vector, omitting Federalist 10
madison_vector <- tidy_federalist |>
mutate(word = factor(word)) |>
filter(title != 'No. X.',
author == 'Madison') |>
count(word, .drop = FALSE) |>
pull(n) |>
set_names(interesting_words)
madison_vector
although always commonly consequently considerable
8 5 0 12 5
heretofore upon whilst
1 7 12
dmultinom(x = fed10_vector,
prob = hamilton_vector)
[1] 0.0003206053
dmultinom(x = fed10_vector,
prob = madison_vector)
[1] 0.0072
dmultinom(x = fed10_vector,
prob = jay_vector)
[1] 0.08292841
Already we can see a problem. The words we chose before may do a good job distinguishing between Hamilton and Madison (it’s inconceivable that Hamilton would have written an entire paper without a single use of the word “upon”, so the model correctly rules him out), but it struggles with distinguishing Jay’s style from Madison’s style, especially for a paper that only contains 3 of the chosen stop words. Because the word “always” appears the most frequently, the model incorrectly predicts that Jay wrote Federalist No. 10, not Madison.
Let’s see if we can find a set of discriminating words that do a better job distinguishing between all three authors.2 To do so, I’ll start with the list of stop words from Mosteller and Wallace (1964) and add in the list of stop words from the tidytext
package.
mw1964_words <- c("a", "all", "also", "an", "and", "any", "are", "as", "at", "be", "been", "but", "by", "can", "do", "down",
"even", "every", "for", "from", "had", "has", "have", "her", "his", "if", "in", "into", "is", "it", "its",
"may", "more", "must", "my", "no", "not", "now", "of", "on", "one", "only", "or", "our", "shall", "should",
"so", "some", "such", "than", "that", "the", "their", "then", "there", "things", "this", "to", "up", "upon",
"was", "were", "what", "when", "which", "who", "will", "with", "would", "your")
all_stopwords <- union(mw1964_words, get_stopwords()$word)
This yields a list of 186 potential words. Next, we’ll count up how frequently each author used each word, keeping only the words that one author used three times more frequently than another auhor (the number three is arbitrary, but as we’ll see it yields a manageable list of distinctive words).
tidy_federalist <- d |>
# tokenize to the word level
unnest_tokens(input = 'text',
output = 'word') |>
filter(word %in% all_stopwords) |>
mutate(author = factor(author),
word = factor(word))
frequency_table <- tidy_federalist |>
filter(author %in% c('Hamilton', 'Jay', 'Madison')) |>
count(author, word, .drop = FALSE) |>
pivot_wider(names_from = 'author',
values_from = 'n') |>
# normalize by each author's number of words
mutate(Hamilton = Hamilton / sum(Hamilton),
Jay = Jay / sum(Jay),
Madison = Madison / sum(Madison)) |>
mutate(Hamilton_Jay = Hamilton / Jay,
Hamilton_Madison = Hamilton / Madison,
Jay_Madison = Jay / Madison) |>
# just keep the words that one author uses 3 times more often than another
filter(Hamilton_Jay > 3 |
Hamilton_Jay < 0.33333 |
Hamilton_Madison > 3 |
Hamilton_Madison < 0.33333 |
Jay_Madison > 3 |
Jay_Madison < 0.33333)
This leaves us with 38 words to include in our multinomial model / document vectors.
interesting_words <- factor(frequency_table$word)
interesting_words
[1] about above again also before below
[7] did down during every further having
[13] here hers herself himself his how
[19] itself me my myself off ought
[25] our ours ourselves over she theirs
[31] there through up upon when where
[37] while whom yourselves
39 Levels: about above again also before below did down ... yourselves
Let’s see how the modified list performs on the validation test.
# keep only the new list of words
tidy_federalist <-
tidy_federalist |>
filter(word %in% interesting_words) |>
mutate(word = factor(word))
# compute the new Federalist 10 vector
fed10_vector <- tidy_federalist |>
filter(title == 'No. X.') |>
count(word, .drop = FALSE) |>
pull(n) |>
# add names to make the vector more readable
set_names(interesting_words)
fed10_vector
about above again also before below
0 0 2 0 0 0
did down during every further having
0 0 0 3 0 1
here hers herself himself his how
1 0 0 1 8 0
itself me my myself off ought
2 1 0 0 0 2
our ours ourselves over she theirs
8 0 0 6 0 0
there through up upon when where
6 2 1 0 2 2
while whom yourselves
0 0 0
# get word count vectors for each author
madison_vector <- tidy_federalist |>
filter(title != 'No. X.',
author == 'Madison') |>
count(word, .drop = FALSE) |>
pull(n) |>
set_names(interesting_words)
madison_vector
about above again also before below
1 2 11 31 13 2
did down during every further having
3 3 11 91 18 7
here hers herself himself his how
23 1 2 6 49 13
itself me my myself off ought
27 3 4 4 1 59
our ours ourselves over she theirs
41 0 3 37 3 2
there through up upon when where
29 6 2 7 20 35
while whom yourselves
0 9 0
hamilton_vector <- tidy_federalist |>
filter(author == 'Hamilton') |>
count(word, .drop = FALSE) |>
pull(n) |>
set_names(interesting_words)
hamilton_vector
about above again also before below
20 5 5 36 31 2
did down during every further having
12 14 17 179 33 32
here hers herself himself his how
33 0 5 34 239 53
itself me my myself off ought
81 15 39 15 3 159
our ours ourselves over she theirs
191 7 21 65 16 1
there through up upon when where
379 27 34 374 110 86
while whom yourselves
36 34 0
jay_vector <- tidy_federalist |>
filter(author == 'Jay') |>
count(word, .drop = FALSE) |>
pull(n) |>
set_names(interesting_words)
jay_vector
about above again also before below
0 2 1 11 0 0
did down during every further having
6 0 0 5 0 5
here hers herself himself his how
1 0 0 0 5 9
itself me my myself off ought
1 3 3 0 1 2
our ours ourselves over she theirs
38 0 3 2 4 1
there through up upon when where
10 0 0 1 15 2
while whom yourselves
2 10 1
dmultinom(x = fed10_vector,
prob = madison_vector + 0.1) /
dmultinom(x = fed10_vector,
prob = hamilton_vector + 0.1)
[1] 76.72475
dmultinom(x = fed10_vector,
prob = madison_vector + 0.1) /
dmultinom(x = fed10_vector,
prob = jay_vector + 0.1)
[1] 1.164301e+12
cosine_similarity(madison_vector, fed10_vector)
[1] 0.7437931
cosine_similarity(hamilton_vector, fed10_vector)
[1] 0.7210126
cosine_similarity(jay_vector, fed10_vector)
[1] 0.6583766
This is much nicer. Both methods now correctly identify Madison as the author of Federalist 10. A more thorough model-development loop might iterate through validation and refinement a few more times, holding out a different validation set each time. But this is the basic workflow.