How to quantify what a word means.
Often, we’d like to not only count the frequency of words, but also get a sense of what the words mean. In a bag of words representation, we treat words words like “president” and “executive” as separate indices in a word count vector, implicitly assuming that they have completely unique meanings. But the statistical models we use to understand text data will perform better if words with similar meaning have similar representations. That is the purpose of the word embeddings approach, which represents each word as a vector, encoding the fact that “president” and “executive” have some overlapping meaning by placing their vectors close together.
In this exercise, we’ll work with a set of pre-trained word embeddings called GloVe. These off-the-shelf word embeddings tend to do a pretty good job at capturing meaning, even for political science specific applications (Rodriguez and Spirling 2021).1
For expository purposes, let’s download the 100-dimensional word embeddings here. That link downloads a zipped folder called ‘glove.6B/’. Unzip it and make sure it is in your ‘data/’ folder.
glove <- read_table("data/glove.6B/glove.6B.100d.txt",
col_names = FALSE)
Converting this dataframe to a matrix object makes mathematical operations much faster.
It’s difficult to visualize and interpret a 100-dimensional vector space, but we can explore which words have similar meaning by looking at their cosine similarity.
# define cosine similarity
cosine_similarity <- function(x1, x2){
sum(x1*x2) / sqrt(sum(x1^2)) / sqrt(sum(x2^2))
}
cosine_similarity(glove['president',],
glove['executive',])
[1] 0.6637274
cosine_similarity(glove['president',],
glove['legislative',])
[1] 0.4405054
cosine_similarity(glove['president',],
glove['judicial',])
[1] 0.363681
By looking at a word’s “nearest neighbors” in GLoVe-space, we can get a sense of its meaning. The following function, adapted from Emil Hvitfeldt and Julia Silge, performs that computation.
What words are most closely associated with the word “democracy”?
nearest_glove_neighbors('democracy')
democracy freedom unity independence
1.0000000 0.7387126 0.7104656 0.7001239
political movement opposition peace
0.6768800 0.6690903 0.6657763 0.6654979
freedoms peaceful pluralism revolution
0.6632308 0.6627046 0.6584346 0.6490769
democratization reconciliation regime reform
0.6479734 0.6455120 0.6426720 0.6425315
solidarity stability socialist reforms
0.6411311 0.6383692 0.6335430 0.6330405
dictatorship democratic multiparty socialism
0.6287793 0.6263217 0.6260686 0.6223621
leadership prosperity politics rights
0.6219702 0.6176357 0.6171205 0.6147376
struggle establishment communist pro
0.6096227 0.6075584 0.6072464 0.6015733
human equality globalization repression
0.6000897 0.5962141 0.5942674 0.5911100
rule party elections communism
0.5904003 0.5897764 0.5890823 0.5866827
country junta progress government
0.5859989 0.5846966 0.5846514 0.5792781
governance support leaders economic
0.5790971 0.5778673 0.5744629 0.5720127
constitutional dialogue secularism social
0.5718961 0.5716203 0.5705830 0.5704782
dissidents society constitution hope
0.5679337 0.5672875 0.5638954 0.5630879
respect activists restore nation
0.5617298 0.5615236 0.5610132 0.5600489
activism ideals coalition revolutionary
0.5597208 0.5589497 0.5566700 0.5547408
leader civic myanmar secular
0.5542812 0.5517091 0.5508700 0.5501198
undermine autonomy protests ruling
0.5498861 0.5495106 0.5479063 0.5467623
dignity principles integrity process
0.5460128 0.5457307 0.5447343 0.5434307
promote legitimacy transition activist
0.5433193 0.5414689 0.5408329 0.5402462
suu citizens integration dissent
0.5398066 0.5395628 0.5394340 0.5394053
efforts free apartheid supporters
0.5377405 0.5369167 0.5349902 0.5346813
pro- monarchy ensure capitalism
0.5346307 0.5335712 0.5335646 0.5330306
sovereignty power capitalist crackdown
0.5324788 0.5313657 0.5309806 0.5306148
decades agenda uprising kyi
0.5303713 0.5301271 0.5286190 0.5285096
Grimmer, Stewart, and Roberts (2021), Chapter 8.
Hvitfeldt & Silge, Chapter 5.
See code/03_word-embeddings/federalist-embeddings.R
if you are interested in how to train word embeddings on your own corpus.↩︎