Word Embeddings

How to quantify what a word means.

Often, we’d like to not only count the frequency of words, but also get a sense of what the words mean. In a bag of words representation, we treat words words like “president” and “executive” as separate indices in a word count vector, implicitly assuming that they have completely unique meanings. But the statistical models we use to understand text data will perform better if words with similar meaning have similar representations. That is the purpose of the word embeddings approach, which represents each word as a vector, encoding the fact that “president” and “executive” have some overlapping meaning by placing their vectors close together.

In this exercise, we’ll work with a set of pre-trained word embeddings called GloVe. These off-the-shelf word embeddings tend to do a pretty good job at capturing meaning, even for political science specific applications (Rodriguez and Spirling 2021).1

For expository purposes, let’s download the 100-dimensional word embeddings here. That link downloads a zipped folder called ‘glove.6B/’. Unzip it and make sure it is in your ‘data/’ folder.

glove <- read_table("data/glove.6B/glove.6B.100d.txt", 
                    col_names = FALSE)

Converting this dataframe to a matrix object makes mathematical operations much faster.

vocab <- glove$X1

glove <- glove |>
  select(X2:X101) |>
  as.matrix()

rownames(glove) <- vocab

It’s difficult to visualize and interpret a 100-dimensional vector space, but we can explore which words have similar meaning by looking at their cosine similarity.

# define cosine similarity
cosine_similarity <- function(x1, x2){
  sum(x1*x2) / sqrt(sum(x1^2)) / sqrt(sum(x2^2))
}

cosine_similarity(glove['president',], 
                  glove['executive',])
[1] 0.6637274
cosine_similarity(glove['president',], 
                  glove['legislative',])
[1] 0.4405054
cosine_similarity(glove['president',], 
                  glove['judicial',])
[1] 0.363681

By looking at a word’s “nearest neighbors” in GLoVe-space, we can get a sense of its meaning. The following function, adapted from Emil Hvitfeldt and Julia Silge, performs that computation.

nearest_glove_neighbors <- function(word, n = 100){
  sim2(x = glove,
     y = glove[word, , drop = FALSE],
     method = 'cosine',
     norm = 'l2')[,1] |>
  sort(decreasing = TRUE) |>
  head(n)
}

What words are most closely associated with the word “democracy”?

nearest_glove_neighbors('democracy')
      democracy         freedom           unity    independence 
      1.0000000       0.7387126       0.7104656       0.7001239 
      political        movement      opposition           peace 
      0.6768800       0.6690903       0.6657763       0.6654979 
       freedoms        peaceful       pluralism      revolution 
      0.6632308       0.6627046       0.6584346       0.6490769 
democratization  reconciliation          regime          reform 
      0.6479734       0.6455120       0.6426720       0.6425315 
     solidarity       stability       socialist         reforms 
      0.6411311       0.6383692       0.6335430       0.6330405 
   dictatorship      democratic      multiparty       socialism 
      0.6287793       0.6263217       0.6260686       0.6223621 
     leadership      prosperity        politics          rights 
      0.6219702       0.6176357       0.6171205       0.6147376 
       struggle   establishment       communist             pro 
      0.6096227       0.6075584       0.6072464       0.6015733 
          human        equality   globalization      repression 
      0.6000897       0.5962141       0.5942674       0.5911100 
           rule           party       elections       communism 
      0.5904003       0.5897764       0.5890823       0.5866827 
        country           junta        progress      government 
      0.5859989       0.5846966       0.5846514       0.5792781 
     governance         support         leaders        economic 
      0.5790971       0.5778673       0.5744629       0.5720127 
 constitutional        dialogue      secularism          social 
      0.5718961       0.5716203       0.5705830       0.5704782 
     dissidents         society    constitution            hope 
      0.5679337       0.5672875       0.5638954       0.5630879 
        respect       activists         restore          nation 
      0.5617298       0.5615236       0.5610132       0.5600489 
       activism          ideals       coalition   revolutionary 
      0.5597208       0.5589497       0.5566700       0.5547408 
         leader           civic         myanmar         secular 
      0.5542812       0.5517091       0.5508700       0.5501198 
      undermine        autonomy        protests          ruling 
      0.5498861       0.5495106       0.5479063       0.5467623 
        dignity      principles       integrity         process 
      0.5460128       0.5457307       0.5447343       0.5434307 
        promote      legitimacy      transition        activist 
      0.5433193       0.5414689       0.5408329       0.5402462 
            suu        citizens     integration         dissent 
      0.5398066       0.5395628       0.5394340       0.5394053 
        efforts            free       apartheid      supporters 
      0.5377405       0.5369167       0.5349902       0.5346813 
           pro-        monarchy          ensure      capitalism 
      0.5346307       0.5335712       0.5335646       0.5330306 
    sovereignty           power      capitalist       crackdown 
      0.5324788       0.5313657       0.5309806       0.5306148 
        decades          agenda        uprising             kyi 
      0.5303713       0.5301271       0.5286190       0.5285096 

Practice Problems

  1. Explore some of the stereotypes reflected in the GloVe embeddings. How close is the word “professor” to female names compared to male names? Hispanic names?
  2. What about words that are ambiguous without context, like “bill” or “share”? What are their nearest neighbors?

Further Reading

Grimmer, Justin, Brandon M. Stewart, and Margaret E. Roberts. 2021. Text as Data: A New Framework for Machine Learning and the Social Sciences. S.l.: Princeton University Press.
Rodriguez, Pedro L., and Arthur Spirling. 2021. “Word Embeddings: What Works, What Doesnt, and How to Tell the Difference for Applied Research.” The Journal of Politics, May, 000–000. https://doi.org/10.1086/715162.

  1. See code/03_word-embeddings/federalist-embeddings.R if you are interested in how to train word embeddings on your own corpus.↩︎

References