Text As Data: Clustering

Broadly speaking, we can divide the approaches for modeling text data into two camps: supervised learning and unsupervised learning. Supervised learning approaches tend to be the most familiar to social scientists – there is some outcome we’d like to predict, so we fit a function of observable covariates to try and predict it. In the context of text as data, this means we have a set of labeled documents, and we fit a model to see how well we can predict the labels (e.g. predicting the authorship of the Federalist Papers).

Unsupervised learning, by comparison, is less about prediction and more about discovery. You start with a set of unlabeled documents, and ask the computer to see if it can find a sensible way to organize them. Are there patterns of language that distinguish one set of documents from others? What words can help identify a cluster of documents, by appearing within them more frequently than one would expect by chance? These sorts of approaches, which include both clustering and topic models, require a healthy dose of human judgment to derive meaningful insights, and they often serve as the first stage of a research agenda that moves from discovery to explanation, prediction, and inference.

K-means Clustering

Chapter 12 of Grimmer, Stewart, and Roberts (2021) introduces the dataset of Congressional press releases that Grimmer (2013) explores in his study of representational style. Using a k-means clustering model, he developed a set of categories to describe these ways that members of Congress communicate with their constituents, discovering new categories that were previously understudied by political scientists. The full dataset is available here, and I’ve included the press releases from Senator Lautenberg on the course repository. Let’s load and tidy the data, representing each press release as a bag of word stems.

library(tidyverse)
library(tidytext)
library(SnowballC)

load('data/press-releases/lautenberg-press-releases.RData')

tidy_press_releases <- df |>
  # remove the preamble common to each press release
  mutate(text = str_replace_all(text,
                                pattern = '     Senator Frank R  Lautenberg                                                                                                                      Press Release        of        Senator Lautenberg                                                                                ',
                                replacement = '')) |>
  # tokenize to the word level
  unnest_tokens(input = 'text',
                output = 'word') |>
  # remove stop words
  anti_join(get_stopwords()) |>
  # remove numerals
  filter(str_detect(word, '[0-9]', negate = TRUE)) |>
  # create word stems
  mutate(word_stem = wordStem(word)) |>
  filter(word_stem != '') |> 
  # count up bag of word stems
  count(id, word_stem) |> 
  # compute term frequency
  bind_tf_idf(term = 'word_stem',
              document = 'id',
              n = 'n') |>
  filter(!is.na(tf_idf))

tidy_press_releases

# A tibble: 81,504 × 6
      id word_stem     n      tf   idf  tf_idf
   <int> <chr>     <int>   <dbl> <dbl>   <dbl>
 1     1 account       2 0.00654 1.86  0.0121 
 2     1 also          2 0.00654 0.891 0.00582
 3     1 america       2 0.00654 1.80  0.0118 
 4     1 american      1 0.00327 1.05  0.00344
 5     1 answer        1 0.00327 3.69  0.0120 
 6     1 apologi       1 0.00327 5.63  0.0184 
 7     1 appropri      1 0.00327 1.60  0.00522
 8     1 april         2 0.00654 2.18  0.0143 
 9     1 ask           1 0.00327 2.13  0.00698
10     1 assault       2 0.00654 3.93  0.0257 
# ℹ 81,494 more rows

Next, we’ll convert that tidy dataframe into a document-term matrix.

# create document-term matrix
lautenberg_dtm <- cast_dtm(data = tidy_press_releases,
                           document = 'id',
                           term = 'word_stem',
                           value = 'tf')
lautenberg_dtm

<<DocumentTermMatrix (documents: 558, terms: 7073)>>
Non-/sparse entries: 81504/3865230
Sparsity           : 98%
Maximal term length: 33
Weighting          : term frequency (tf)

The k-means clustering algorithm searches for a set of \(k\) centroids that yield the smallest sum of squared distances between the observations and their nearest centroid. If each document is represented by a vector of term frequencies, then k-means produces \(k\) sets of documents that have the most similar usages of words.¹

set.seed(42)

km <- kmeans(x = lautenberg_dtm,
             centers = 4,
             nstart = 100)

table(km$cluster)


  1   2   3   4 
158  53 264  83

Making sense of this algorithm’s output is tricky. Sure, we simplified the problem a bit. We started with 558 documents, each represented by a 7,073-dimensional vector. Now we have 4 document clusters, each represented by a 7,073-dimensional vector.

So…what do we do with those?

One of the most common ways to interpret the k-means clusters is to generate a list of the most distinctive words from each cluster. Then we can look at which words show up more frequently in one cluster than any other, and use that information to assign labels to the clusters.

# function to find the words that are most overrepresented in the cluster mean for a given cluster
get_top_words <- function(centers, cluster_of_interest, n = 10){
  (centers[cluster_of_interest,] - colMeans(centers[-cluster_of_interest,])) |>
    sort(decreasing = TRUE) |>
    head(n)
}

get_top_words(km$centers, 1)

        new      jersei    menendez        fund     project 
0.020521488 0.017404097 0.007306990 0.006406194 0.004548713 
    million       feder          nj     program         sen 
0.003967707 0.003579091 0.003422422 0.003390473 0.003174855

It looks like that first cluster contains words related to New Jersey-specific projects. Maybe we’ll call this category “credit claiming”.

get_top_words(km$centers, 2)

      secur      chemic    homeland        port     protect 
0.047101748 0.017238562 0.012628968 0.009966087 0.005874105 
       risk         law          dh        bill  lautenberg 
0.005328492 0.005129255 0.004088366 0.003876055 0.003604233

The second cluster contains words related to security.

get_top_words(km$centers, 3)

     legisl         sen    american        bill         epa 
0.003556391 0.002868324 0.002606283 0.002589137 0.002394477 
        act      famili      victim        year      amtrak 
0.002078147 0.001866299 0.001717110 0.001707263 0.001613547

The third cluster has words related to various pieces of legislation and Senate business.

get_top_words(km$centers, 4)

     presid        bush   statement       senat     comment 
0.019325460 0.015638002 0.007088761 0.004435791 0.004353666 
       unit       elect      follow        issu         tax 
0.004192093 0.004036246 0.003940064 0.003897682 0.003500055

And the final cluster looks like the “partisan taunting” category dicussed in the book.

Validation, Validation, Validation

To validate our manually-assigned cluster labels, we want to go back to the text and check to see if they do a good job summarizing the documents. If not, we should modify our cluster labels or try a different value for \(k\).

cluster_assignments <- tibble(id = km$cluster |> 
                                names() |> 
                                as.numeric(),
                              cluster = km$cluster)

df <- df |>
  left_join(cluster_assignments,
            by = 'id')

If we pull a random document from Cluster 1, it should be related to New Jersey in some way.

print_text <- function(text){
  cat(str_wrap(text), sep = '\n')
}

df |>
  filter(cluster == 1) |>
  slice_sample(n = 1) |>
  pull(text) |> 
  print_text()

Senator Frank R Lautenberg Press Release of Senator Lautenberg Lautenberg
Menendez Announce 21 Million for Improvements in Screening Areas at Newark
Liberty Airport Contact Alex Formuzis 202 224 7340 Thursday August 3 2006
WASHINGTON D C Air travelers who depart from Newark Liberty International
Airport will benefit from improvements in the security screening area thanks
to almost 21 million in federal grants announced today by U S Senators Frank
Lautenberg D NJ and Robert Menendez D NJ The grants will be used to widen
terminal connecting areas creating more space for passengers waiting to
pass through security checkpoints These improvements will make it even more
convenient to fly from Newark Liberty Airport said Senator Lautenberg By
improving the airport we protect our economy and our quality of life Newark
Liberty International Airport is a huge hub of activity for New Jersey and
the nation It provides jobs for New Jerseyans acts as a means for American and
international businessmen and women to work throughout the region and allows
visitors to come enjoy our great state said Senator Menendez Senator Lautenberg
and I fought tirelessly for these funds to improve the safety and efficiency of
this thriving center of travel The funds were awarded by the Federal Aviation
Administration to the Port Authority of New York and New Jersey which operates
the airport Questions or Comments

If we pull a random document from Cluster 2, it should be about security.

df |>
  filter(cluster == 2) |>
  slice_sample(n = 1) |>
  pull(text) |> 
  print_text()

Senator Frank R Lautenberg Press Release of Senator Lautenberg Lautenberg
Pallone Menendez Blast Proposal to Block New Jersey s Chemical Security
Regulations Contact Alex Formuzis 202 224 7340 Thursday February 8 2007
WASHINGTON D C U S Sens Frank R Lautenberg D NJ and Robert Menendez D NJ and U S
Rep Frank Pallone Jr D NJ today blasted the U S Department of Homeland Security
DHS for proposing a federal rule that would preempt New Jersey s existing
chemical security regulations The three lawmakers submitted extensive comments
as part of the public comment period for a DHS proposed regulation to develop
temporary federal regulations to help secure chemical facilities A copy of that
letter is attached The regulation was developed in response to a legislative
provision in the Fiscal Year 2007 Homeland Security Appropriations Act passed
in 2006 Although that provision did not give the Department the right to preempt
state or local laws on the subject the Department s recent proposal assumed
such authority As representatives of the citizens of New Jersey we simply cannot
accept a proposed regulatory scheme that requires our constituents to rely
upon the best efforts of private companies and this Administration to ensure
their safety from terrorist attacks on chemical facilities in their communities
the three New Jersey lawmakers wrote In 2005 New Jersey implemented Chemical
Security Sector Best Practices requiring all chemical facilities in the state to
comply with security standards conduct an assessment of their vulnerability to
terrorist attacks develop prevention preparedness and response plans to minimize
such attacks and review whether it would be practical to use safer materials
or processes New Jersey took steps to improve its security after 9 11 and it
may need to take additional steps in the future Lautenberg Pallone and Menendez
continued in their public comment letter We strongly oppose any efforts by
DHS and the rest of this Administration to prevent it from doing so Lautenberg
and Pallone have introduced comprehensive chemical security bills in previous
Congresses and announced their intention to do so during this Congress Questions
or Comments

If we pull a random document from Cluster 3, it be about legislation and/or Senate business.

df |>
  filter(cluster == 3) |>
  slice_sample(n = 1) |>
  pull(text) |> 
  print_text()

Senator Frank R Lautenberg Press Release of Senator Lautenberg Lautenberg
Specter Introduce Bill To Give Justice To Victims of State Sponsored Terrorism
Measure Would Empower Victims To Pursue Assets of Countries Like Iran That
Sponsor Terror Contact Press Office 202 224 3224 Thursday August 2 2007
WASHINGTON D C Sen Frank R Lautenberg D NJ and Sen Arlen Specter R PA today
led a strong bipartisan coalition of Senators introducing legislation to give
victims of state sponsored terrorism their day in court Far too many Americans
have suffered at the hands of terrorism My bill would allow victims of state
sponsored terror to have their day in court It would let victims sue countries
and hold those countries accountable said Sen Lautenberg I am pleased to
cosponsor this legislation which gives the victims of terrorism and their
families the ability to seek legal redress said Sen Specter This bill reaffirms
that the United States will not tolerate state sponsored terrorism The bill
would allow victims of state sponsored terror to sue countries that promote
terrorism The measure would allow victims to seize hidden commercial assets for
compensation This legislation is important to the families of the victims of the
1983 Marine Barracks bombing in Beirut Lebanon It will hold the government of
Iran accountable for the murder of 241 men in this bombing one of whom was my
brother Captain Vincent L Smith United States Marine Corps The injustice of this
over the long years has been a heavy burden the Iranian government has literally
been getting away with murder for almost 24 years The passage of this bill will
bring justice by holding the criminals accountable for their crime And I believe
it will mitigate future terrorism This bill is a huge statement of support for
victims of terrorism and a powerful way to fight terrorism without the use of
military force said Lynn Derbyshire who serves as the national spokesperson for
The Beirut Families The legislation the Justice for Victims of State Sponsored
Terrorism Act is based on a 1996 amendment to the Foreign Sovereign Immunities
Act known as the Flatow Amendment which enabled American victims of terrorism to
go after state sponsors of terrorism in court The billwould reaffirm the rights
of plaintiffs to sue state sponsors of terrorism allow the seizure of hidden
commercial assets belonging to terrorist states so victims of terrorism can be
justly compensated limit the number of appeals that a terrorist state can pursue
in U S courts and provide foreign nationals working for the U S government these
same benefits if they are victimized in a terrorist attack during their official
duties The measure has an impressive bipartisan list of original cosponsors
including Senators Robert Menendez D NJ Trent Lott R MS Joseph Biden D DE John
Cornyn R TX Hillary Clinton D NY Lindsey Graham R SC Diane Feinstein D CA Joseph
Lieberman I CT Charles Schumer D NY Norm Coleman R MN Robert Casey D PA Susan
Collins R ME and Ted Stevens R AK Questions or Comments

And a random document from Cluster 4 should contain some form of “partisan taunting”.

df |>
  filter(cluster == 4) |>
  slice_sample(n = 1) |>
  pull(text) |> 
  print_text()

Senator Frank R Lautenberg Press Release of Senator Lautenberg Lautenberg
Outraged Over Bush Admin Decision to Let Libya Off the Hook for Pan Am 103
Terrorist Bombing Contact Alex Formuzis 202 224 7340 Wednesday June 28 2006
WASHINGTON D C United States Senator Frank R Lautenberg D NJ who has led the
fight on behalf of the families of victims of Pan Am 103 today issued the
following statement in response to the Bush Administration s decision to let
the Libyan government off the hook Today the Bush Administration put other
interests ahead of American victims of terrorism I am very disappointed that
the Administration chose to renew its relationship with Qadhafi before making
sure he fulfilled his promises to American victims of his terror said Senator
Lautenberg Under the original agreement between the Libyan government and the
families of the victims of Pan Am 103 each family was to receive 10 million from
the Libyan government to be paid out in three installments 4 million when the U
N lifted its sanctions 4 million when the U S lifted its trade sanctions and the
final 2 million when Libya was taken off the U S terrorist list which officially
happens today On May 15 Secretary of State Condoleezza Rice announced that the
administration would renew diplomatic relations with Libya at which point a
45 day review began That review period ends today and Libya will be formally
removed from the U S State Department s list of state sponsors of terrorism
The Libyan Government and the Bush administration appear to have agreed that
the final payment does not need to be paid to the families Earlier this month
the Senate approved a Resolution by Senators Lautenberg and Lindsey Graham R SC
urging the Bush administration not to establish diplomatic relations with Libya
until it fulfills its responsibilities to the families of the Pan Am 103 victims
In August 2003 the Libyan government took responsibility for the bombing of
Pan Am flight 103 over Lockerbie Scotland on December 21st 1988 that killed 270
people Of the 189 Americans who died as a result of the bombing 38 were from New
Jersey Questions or Comments

Clustering with Word Embeddings

In that last clustering exercise, we represented each document as a vector of word counts. This can lead us astray when the documents are very short but our vocabulary is very large. For example, Senator Lautenberg has a number of press releases that I would classify as being about the environment, but the Bag of Words has no idea that words like “preservation”, “mercury”, and “rivers” might belong in the same category.

When documents are as brief as these press releases, one could potentially represent them with the average word embedding of the words in each document. The k-means clustering algorithm then works in the same way, except that it is assigning clusters within a vector space corresponding to the meaning of the documents (in a sense) rather than a vector space that’s just counting up the frequency of words.

Let’s give it a try, shall we? First, get the word embeddings from the textdata package.

# get the glove embedding vectors
glove <- read_table("data/glove.6B/glove.6B.100d.txt", 
                    col_names = FALSE)

# convert to a matrix (it will make the computation easier)
vocab <- glove$X1

glove <- glove |>
  select(X2:X101) |>
  as.matrix()

rownames(glove) <- vocab

glove[100:120,1:3]

               X2        X3        X4
u.s.     0.323960  0.598100  1.237800
so      -0.395510  0.546600  0.503150
them    -0.101310  0.109410  0.240650
what    -0.151800  0.384090  0.893400
him      0.042409 -0.521950  0.403890
united   0.217330  0.561160  0.630620
during  -0.278910 -0.229740 -0.474540
before   0.362810 -0.185590  0.461190
may      0.082528 -0.075290  0.014696
since    0.422610  0.309450  0.218540
many    -0.329140  0.828870 -0.141820
while    0.094157  0.464570  0.453500
where    0.051044  0.598240  0.311950
states   0.138150  0.451660  0.938580
because  0.067634  0.415950  0.584510
now     -0.014495  0.591070  0.704690
city     0.265720  0.034857  0.490550
made    -0.198200 -0.284050  0.145840
like    -0.268700  0.817080  0.698960
between  0.082441 -0.040760  0.525170
did      0.304490 -0.196280  0.202250

Next, we’ll tokenize the press releases, removing stop words and keeping only the words available in the GloVe lexicon.

# get all the press releases and count up the words for each
tidy_press_releases <- df |>
  # get rid of the earlier cluster assignments
  select(-cluster, -date) |>
  mutate(text = str_replace_all(text,
                                pattern = '     Senator Frank R  Lautenberg                                                                                                                      Press Release        of        Senator Lautenberg                                                                                ',
                                replacement = '')) |>
  # tokenize to the word level
  unnest_tokens(input = 'text',
                output = 'word') |>
  # remove stop words
  anti_join(get_stopwords()) |>
  # remove numerals
  filter(str_detect(word, '[0-9]', negate = TRUE)) |>
  # remove the words that aren't in the glove lexicon
  filter(word %in% vocab)

Next, compute the average word embedding of each word in the document. I’m going to do it with a for loop, because I can’t think of a more elegant solution.

# create an empty matrix called document embeddings
document_embeddings <- matrix(nrow = nrow(df),
                              ncol = 100)

for(i in 1:nrow(df)){

  list_of_words <- tidy_press_releases |>
    filter(id == i) |>
    pull(word)

  document_embeddings[i,] <- colMeans(glove[list_of_words,])

}

glimpse(document_embeddings)

 num [1:558, 1:100] 0.0489 0.0433 0.0581 -0.0696 -0.0526 ...

Now we have a matrix that represents each press release as a 100-dimensional vector. Using k-means, we will identify 6 clusters that best describe those document vectors.

km <- kmeans(x = document_embeddings,
             centers = 6,
             nstart = 100)

# merge the cluster assignments back with the documents
cluster_assignments <- tibble(id = 1:558,
                              cluster = km$cluster)

df <- df |>
  # remove the earlier cluster assignment from bag of words
  select(-cluster) |>
  left_join(cluster_assignments,
            by = 'id')

table(df$cluster)


  1   2   3   4   5   6 
 74 101  50  81 158  94

You’ll note, if you’re playing along at home, that this line of code was much quicker than when we estimated k-means on the 7,073-dimensional bag of words vectors. Let’s look at the most over-represented words in each cluster.

# a function to get the most over-represented words by cluster
get_top_words <- function(tidy_press_releases, cluster_of_interest){
  tidy_press_releases |>
    count(id, word) |>
    left_join(cluster_assignments, by = 'id') |>
    mutate(in_cluster = if_else(cluster == cluster_of_interest,
                                'within_cluster', 'outside_cluster')) |>
    # count the words in each cluster
    group_by(in_cluster, word) |>
    summarize(n = sum(n)) |>
    pivot_wider(names_from = 'in_cluster',
                values_from = 'n',
                values_fill = 0) |>
    # compute word shares
    mutate(within_cluster = within_cluster / sum(within_cluster),
           outside_cluster = outside_cluster / sum(outside_cluster)) |>
    mutate(delta = within_cluster - outside_cluster) |>
    arrange(-delta) |>
    head(10) |>
    pull(word)
}

get_top_words(tidy_press_releases, 1)

 [1] "new"      "jersey"   "river"    "million"  "county"   "project" 
 [7] "menendez" "funds"    "center"   "projects"

get_top_words(tidy_press_releases, 2)

 [1] "epa"           "facilities"    "environmental" "waste"        
 [5] "buildings"     "water"         "site"          "chemicals"    
 [9] "global"        "act"

get_top_words(tidy_press_releases, 3)

 [1] "judge"    "court"    "honor"    "nfl"      "iraq"     "senator" 
 [7] "alito"    "national" "de"       "memorial"

get_top_words(tidy_press_releases, 4)

 [1] "new"            "rail"           "amtrak"        
 [4] "security"       "transportation" "passenger"     
 [7] "jersey"         "airport"        "faa"           
[10] "port"

get_top_words(tidy_press_releases, 5)

 [1] "s"              "president"      "bush"          
 [4] "security"       "terrorism"      "administration"
 [7] "victims"        "u"              "lautenberg"    
[10] "d"

get_top_words(tidy_press_releases, 6)

 [1] "health"    "children"  "care"      "programs"  "medicare" 
 [6] "education" "military"  "drug"      "coverage"  "budget"

These seem like pretty cohesive categories! And now there’s a cluster for press releases about the environment, consistent with my earlier hunch.

Practice Problems

Try changing the value of \(k\) and see if the cluster assignments seem to improve. Look for cluster labels that are exclusive (topics aren’t overlapping; words that are supposed to distinguish one cluster don’t frequently appear in other clusters) and cohesive (it’s easy to identify a unique topic for each cluster).
When we estimated k-means with document embeddings, the top words in one cluster included ‘judge’, ‘court’, and ‘alito’, which seems pretty cohesive. But they also included ‘nfl’ and ‘iraq’, which seem out-of-place. Investigate some of the documents included in this cluster. Is there something linking them together, or is this just a poor fit?

Grimmer, Justin. 2013. Representational Style in Congress: What Legislators Say and Why It Matters. New York: Cambridge University Press.

Grimmer, Justin, Brandon M. Stewart, and Margaret E. Roberts. 2021. Text as Data: A New Framework for Machine Learning and the Social Sciences. S.l.: Princeton University Press.

Allison Horst has a delightful illustrated explanation of how the algorithm works on this page.↩︎

Clustering

K-means Clustering

Validation, Validation, Validation

Clustering with Word Embeddings

Practice Problems

References