Text As Data: Text Embeddings

Often, we’d like to not only count the frequency of words, but also get a sense of what the words mean. In a bag of words representation, we treat words words like “president” and “executive” as separate indices in a word count vector, implicitly assuming that they have completely unique meanings. But the statistical models we use to understand text data will perform better if words with similar meaning have similar representations. That is the purpose of the embeddings approach, which represents each word (or document) as a vector, encoding the fact that “president” and “executive” have some overlapping meaning by placing their vectors close together.

In this exercise, we’ll work with a set of pretrained text embeddings from OpenAI. These off-the-shelf embeddings tend to do a pretty good job at capturing meaning, even for political science specific applications (Rodriguez and Spirling 2021).¹

library(tidyverse)
library(tidytext)
library(fuzzylink)

The fuzzylink R package contains a convenience function called get_embeddings() which we will use for this exercise. You will need an account with OpenAI and an API key. Once you have your API key, you should save it to your R environment with the following line of code (pasting in your key):

fuzzylink::openai_api_key('<YOUR API KEY GOES HERE>', install = TRUE)

For more information on setup, see the package documentation.

Retrieving Embeddings

To retrieve text embeddings, create a list of words you want to embed, and input the list into the get_embeddings() funtion. The result is a matrix where each row is a 256-dimensional vector representing its meaning

words <- c('president', 'executive', 'legislative',
           'judicial', 'banana')

embeddings <- get_embeddings(words)

embeddings['president',]

  [1] -0.0099220990  0.0766712200 -0.0199635620  0.0252652970
  [5]  0.0609152280  0.0248276300 -0.0137964430  0.1803186200
  [9]  0.0088428540  0.0738860740  0.0068534603  0.0810081000
 [13] -0.1431567500 -0.0838728250 -0.1040054860  0.0196253660
 [17] -0.0331631900  0.0466313800 -0.0827587700 -0.0937402200
 [21] -0.0416976850 -0.0347944900 -0.0314125230 -0.0205902220
 [25]  0.0786208300  0.0137168680  0.0044910560 -0.0268966000
 [29] -0.0183223130  0.0173176700 -0.0522016850  0.1741117100
 [33] -0.0166313300 -0.0074801194  0.0234947370  0.0376393240
 [37]  0.0292042960 -0.0199337230  0.0753980100 -0.0785014600
 [41] -0.0606765000 -0.0111406040 -0.0729709500  0.0175464500
 [45]  0.0152089130  0.0991513650 -0.0432494130 -0.0176658130
 [49]  0.0628648300  0.0574934700 -0.0049162884  0.0615518300
 [53]  0.0687136500  0.0512069870 -0.0481433200  0.0196949950
 [57]  0.0754378000 -0.0744033160 -0.0464722300 -0.1178516700
 [61] -0.0036754042 -0.0280305540 -0.0993901000  0.0649338000
 [65]  0.0283289630 -0.0582892260  0.0547481070  0.0954908900
 [69] -0.0713794400  0.1036871800  0.0419364120 -0.0612733180
 [73] -0.0300398410  0.0394695660  0.0252851900  0.0343767180
 [77] -0.0451194420 -0.0220225840 -0.0052271313  0.0238926150
 [81]  0.0718966800  0.0091512100  0.1890719500  0.0022442844
 [85] -0.0673608600 -0.0406632000 -0.0243103880 -0.2250401800
 [89] -0.0202520250 -0.0209881010  0.0949338500 -0.0085444450
 [93] -0.1005041500  0.0334218100  0.1233423950 -0.0276326740
 [97]  0.0009455836 -0.0925465800 -0.0023002361 -0.1351991700
[101] -0.1847748600  0.0175762900 -0.0495756830  0.0844298600
[105]  0.0367440950 -0.0153581170 -0.0169496310 -0.0167208520
[109] -0.0108421940 -0.0057443734 -0.0378382600 -0.0249469930
[113]  0.0616711970  0.0419762020 -0.0389921100 -0.0018986274
[117] -0.0524802000  0.0634218600 -0.0489390800 -0.0292639770
[121] -0.0885677900 -0.0781035900 -0.0700266500 -0.0677587400
[125]  0.0272347960 -0.0200829260  0.0352719460  0.0755571650
[129] -0.0342772500 -0.0891248300 -0.1087004540  0.0732892500
[133] -0.0523210470  0.0311539000 -0.0269363880  0.0644563400
[137] -0.0368833540  0.0295225980 -0.0098773380  0.0288263110
[141] -0.0845890100  0.0555040760 -0.0737667100 -0.0824404660
[145]  0.1198410600  0.0111008160  0.0958091840 -0.0849868900
[149]  0.1369498400 -0.0233156900 -0.0167606400 -0.0441645350
[153]  0.0096883460  0.0936606450 -0.0250464640 -0.0005632470
[157] -0.0266976600  0.0017257988 -0.0398276560 -0.0407029900
[161] -0.0054956990  0.0649338000  0.0006247939  0.0366645200
[165]  0.0014920451 -0.0197646250  0.0461937150 -0.0389722180
[169] -0.0204509650  0.0354311000 -0.0101210390 -0.0370027160
[173] -0.0724934900 -0.0792574360 -0.0487799270  0.0452785940
[177]  0.1032097340 -0.0380372030  0.0333223380  0.0388926400
[181]  0.0386738070  0.0104841030  0.0401857460 -0.0229973890
[185] -0.0582494400  0.0511274100  0.0436870800 -0.0426525960
[189] -0.0559019560 -0.1176129360  0.0191578590 -0.0106830430
[193] -0.0413395950 -0.0680372600  0.0723741350 -0.0612733180
[197]  0.0842707100 -0.0682759800 -0.0617507730 -0.0781831600
[201]  0.0608754400  0.0144131560  0.0340783100  0.0523210470
[205] -0.1024139750 -0.0218634340  0.0672812800  0.0266180840
[209] -0.1051195500  0.0763529200  0.1281169400 -0.0834749500
[213]  0.1232628150  0.0028125050 -0.0533157440 -0.0012222336
[217] -0.0557030140  0.0116180570  0.0169197920 -0.0606367100
[221]  0.0989126400 -0.0796553100 -0.0540717130  0.0386141280
[225]  0.0492573830 -0.0727322250 -0.0706632500  0.0788197700
[229] -0.0206300100  0.0524404120 -0.0196054730  0.0709019800
[233]  0.0312334760  0.0537534100 -0.0996288200  0.0432892000
[237]  0.0310942200  0.0936606450  0.0048068720  0.0605173500
[241]  0.0012707250  0.0154376930 -0.0077835020  0.0484616230
[245]  0.0603979830  0.0221817360 -0.1143503340 -0.0328249930
[249] -0.0284483270 -0.0685147100 -0.0258024330  0.0026359463
[253] -0.0003332234 -0.0169794730  0.0649338000 -0.0030984802

It’s difficult to visualize and interpret a 256-dimensional vector space, but we can explore which words have similar meaning by looking at their cosine similarity. The get_similarity_matrix() function returns cosine similarity for each pair of vectors in an embedding matrix (or a subset of those vectors if you prefer)

get_similarity_matrix(embeddings)

            president executive legislative  judicial    banana
president   0.9999999 0.6483657   0.4321773 0.4690661 0.3034118
executive   0.6483657 1.0000001   0.5697680 0.5728509 0.2424244
legislative 0.4321773 0.5697680   1.0000000 0.6276265 0.1904764
judicial    0.4690661 0.5728509   0.6276265 1.0000000 0.1964824
banana      0.3034118 0.2424244   0.1904764 0.1964824 1.0000000

get_similarity_matrix(embeddings,
                      'president',
                      c('executive', 'banana'))

          executive    banana
president 0.6483657 0.3034118

This is how large language models like ChatGPT encode meaning, and it seems to do a pretty good job! Pairs of words with similar meaning have higher cosine similarity scores. In the next module we’ll explore how to use these embedding representations to fit models for discovery, prediction, measurement, and inference.

Practice Problems

Explore some of the stereotypes reflected in the OpenAI embeddings. How close is the word “professor” to female names compared to male names? Hispanic names?
What about words that are ambiguous without context, like “bill” or “share”? What are their nearest neighbors?

Text Embeddings

Retrieving Embeddings

Practice Problems

Further Reading

References