A dumb auto-complete model – trained on the entire Internet – can accomplish a remarkable number of tasks if you ask it nicely.
Note (7/30/2024): I have developed a new R
package (promptr) with improved functionality for many of the tasks I perform with the text2data
on this page. All the code here should still work, but I encourage you to check out the new package!
Large language models (LLMs) have transformed the way computer scientists approach natural language processing over the past five years. These models, trained to predict the next word in a sequence, make use of a massive quantity of text data from the Internet and digitized books. In this tutorial, I will hand-wave over exactly how LLMs perform this task, focusing instead on how to adapt such models for use in social science applications. If you are interested in what’s going on under the hood, I highly recommend the following two blog posts for not-too-technical introductions:
In a nutshell, LLMs represent words using embeddings, and are trained to predict the most likely next word in a sequence using a model architecture called the transformer (Vaswani et al. 2017). A powerful feature of this type of model is that the embeddings representing the input sequence are iteratively updated based on which words appear nearby (a process called self-attention). This allows an LLM to flexibly represent words based on their context, which is useful in cases where the same word can mean different things in different contexts. For example, an LLM will represent the word “bill” differently depending on whether it appears in the phrase “sign the bill”, “foot the bill”, or “Hillary and Bill”.
What’s remarkable is that if we have a model that is sufficiently good at predicting the next word in a sequence, we can adapt it to perform all sorts of text-as-data tasks, by creating a prompt that converts our desired task into a next-word prediction problem (Ornstein, Blasingame, and Truscott 2022). To show you how, we will use the text2data
R package to create LLM prompts and submit them to OpenAI’s GPT-3.1 Let’s start by getting the R package set up.
text2data
PackageThe package is currently available as a GitHub repository. To install, first make sure you have the devtools
package available, then you can install the package with the following line of code:
devtools::install_github('joeornstein/text2data')
If you have not yet installed Python and the reticulate
package on your computer, follow the instructions for doing so here. Once that step is complete, you can set up OpenAI’s Python module using the setup_openai()
function, which will allow you to submit prompts to the GPT-3 API.
library(text2data)
setup_openai()
Next, you will need an account with OpenAI. You can sign up for one here, after which you will need to generate an API key here. I recommend adding this API key as a variable in your operating system environment called OPENAI_API_KEY
; that way you won’t risk leaking it by hard-coding it into your R
scripts. The text2data
package will automatically look for your API key under that variable name, and will prompt you to enter the API key manually if it can’t find one there. If you’re unfamiliar with setting Environment Variables in your operating system, here are some helpful instructions. Note that you may need to restart your computer after completing this step.
When the setup is complete, we can begin.
The workhorse function of the text2data
package is complete_prompt()
. This function submits a sequence of words (the prompt) to the OpenAI API, and returns a dataframe with the five highest-probability next word predictions and their associated probabilities.2
library(text2data)
complete_prompt(prompt = 'My favorite food is')
response prob
1 pizza 0.08614531
2 a 0.03913327
3 sushi 0.03705063
4 chicken 0.01834140
5 pasta 0.01819866
If you prefer the model to auto-regressively generate sequences of text instead of outputting the next-word probabilities, set the max_tokens
argument greater than 1. The function will return a character object with the most likely completion at each point in the sequence.
complete_prompt(prompt = 'My favorite food is',
max_tokens = 6)
[1] " pizza. I love pizza."
If we want GPT-3 to perform a classification task, we can structure the prompt so that the best next-word prediction is the classification we want (a process called adaptation). Consider, for example, the following prompt.
prompt <- 'Decide whether the following statement is happy, sad, or neither.\n\nText: I feel happy.\nClassification:'
cat(prompt)
Decide whether the following statement is happy, sad, or neither.
Text: I feel happy.
Classification:
complete_prompt(prompt)
response prob
1 happy 0.25048140
2 Happy 0.14383937
3 Sad 0.09507202
4 sad 0.05949457
5 Neither 0.04669781
The result is a probability vector with the most likely completions. The correct classification – happy – is assigned the highest probability, though note that some post-processing is necessary to combine the “happy” and “Happy” responses.
A few-shot prompt includes several completed examples in the text of the prompt to demonstrate the desired output. Prompts structured this way tend perform significantly better than zero-shot prompts with no examples (Brown et al. 2020).
prompt <- 'Decide whether the following statement is happy, sad, or neither.\n\nText: What should we do today?.\nClassification: Neither\n\nText: My puppy is so cute today.\nClassification: Happy\n\nText: The news is bumming me out.\nClassification: Sad\n\nText: I feel happy.\nClassification:'
cat(prompt)
Decide whether the following statement is happy, sad, or neither.
Text: What should we do today?.
Classification: Neither
Text: My puppy is so cute today.
Classification: Happy
Text: The news is bumming me out.
Classification: Sad
Text: I feel happy.
Classification:
complete_prompt(prompt)
response prob
1 Happy 0.868470778
2 Neither 0.071847533
3 Sad 0.014258582
4 None 0.004738859
5 Both 0.004486662
Manually typing prompts with multiple few-shot examples can be tedious and error-prone, particularly when performing the sort of context-specific prompting we recommend in our paper (Ornstein, Blasingame, and Truscott 2022). The format_prompt()
function is a useful tool to aid in that process.
The function is designed with classification problems in mind. If you input the text you would like to classify along with a set of instructions, the default prompt template looks like this:
library(text2data)
format_prompt(text = 'I am feeling happy today.',
instructions = 'Decide whether this statment is happy, sad, or neither.')
Decide whether this statment is happy, sad, or neither.
Text: I am feeling happy today.
Classification:
You can customize the template using glue
syntax, with placeholders for the text you want to classify {text} and the desired label {label}.
format_prompt(text = 'I am feeling happy today.',
instructions = 'Decide whether this statment is happy or sad.',
template = 'Statement: {text}\nSentiment: {label}')
Decide whether this statment is happy or sad.
Statement: I am feeling happy today.
Sentiment:
This is particularly useful when including few-shot examples in the prompt. If you input these examples as a tidy dataframe with the columns text
and label
, the format_prompt()
function will paste them into the prompt them according to the template. To illustrate, let’s classify the sentiment of a set of tweets about the Supreme Court of the United States, a dataset which is included with the text2data
package.
We can format our few-shot prompt template using format_prompt()
, leaving the {TWEET}
placeholder for the text we want to classify:
library(tidyverse)
library(glue)
prompt <- format_prompt(text = '{TWEET}',
instructions = 'Classify the sentiment of these tweets as Positive, Neutral, or Negative.',
examples = scotus_tweets_examples |>
filter(case == 'masterpiece'),
template = 'Tweet: {text}\nSentiment: {label}')
prompt
Classify the sentiment of these tweets as Positive, Neutral, or Negative.
Tweet: Thank you Supreme Court I take pride in your decision!!!!✝️ #SCOTUS
Sentiment: Positive
Tweet: Supreme Court rules in favor of Colorado baker! This day is getting better by the minute!
Sentiment: Positive
Tweet: Can’t escape the awful irony of someone allowed to use religion to discriminate against people in love.
Not my Jesus.
#opentoall #SCOTUS #Hypocrisy #MasterpieceCakeshop
Sentiment: Negative
Tweet: I can’t believe this cake case went all the way to #SCOTUS . Can someone let me know what cake was ultimately served at the wedding? Are they married and living happily ever after?
Sentiment: Neutral
Tweet: Supreme Court rules in favor of baker who would not make wedding cake for gay couple
Sentiment: Neutral
Tweet: #SCOTUS set a dangerous precedent today. Although the Court limited the scope to which a business owner could deny services to patrons, the legal argument has been legitimized that one's subjective religious convictions trump (no pun intended) #humanrights. #LGBTQRights
Sentiment: Negative
Tweet: {TWEET}
Sentiment:
Using the glue()
function, we can insert tweets into that template.
TWEET <- scotus_tweets$text[42]
TWEET
[1] "This Supreme Court ruling highlights why there needs to be term limits for scotus, healthcare needs to uncoupled from employment, Christianity brings nothing but evil but, most importantly it shows you why voting at every level matters"
glue(prompt)
Classify the sentiment of these tweets as Positive, Neutral, or Negative.
Tweet: Thank you Supreme Court I take pride in your decision!!!!✝️ #SCOTUS
Sentiment: Positive
Tweet: Supreme Court rules in favor of Colorado baker! This day is getting better by the minute!
Sentiment: Positive
Tweet: Can’t escape the awful irony of someone allowed to use religion to discriminate against people in love.
Not my Jesus.
#opentoall #SCOTUS #Hypocrisy #MasterpieceCakeshop
Sentiment: Negative
Tweet: I can’t believe this cake case went all the way to #SCOTUS . Can someone let me know what cake was ultimately served at the wedding? Are they married and living happily ever after?
Sentiment: Neutral
Tweet: Supreme Court rules in favor of baker who would not make wedding cake for gay couple
Sentiment: Neutral
Tweet: #SCOTUS set a dangerous precedent today. Although the Court limited the scope to which a business owner could deny services to patrons, the legal argument has been legitimized that one's subjective religious convictions trump (no pun intended) #humanrights. #LGBTQRights
Sentiment: Negative
Tweet: This Supreme Court ruling highlights why there needs to be term limits for scotus, healthcare needs to uncoupled from employment, Christianity brings nothing but evil but, most importantly it shows you why voting at every level matters
Sentiment:
We can then pipe that prompt into complete_prompt()
to output the desired classification:
prompt |>
glue() |>
complete_prompt()
response prob
1 Negative 0.7229223011
2 Neutral 0.1358330800
3 Positive 0.1276033570
4 Mixed 0.0048710128
5 0.0007889828
As we saw in the sentiment analysis tutorial, conventional methods for classifying sentiment perform pretty poorly on text from social media. Does this approach perform any better? To find out, let’s complete the prompt for each of the tweets in the scotus_tweets
dataframe related to the Masterpiece Cakeshop ruling, create a measure of sentiment, and compare it against the human coders.
First, create our prompt template with format_prompt()
, this time with a bit more detail in the instructions. For few-shot examples, we’ll use the ‘masterpiece’ tweets in scotus_tweets_examples
. These are six tweets that were unanimously coded by three human annotators as Positive, Negative, or Neutral (two per category).
prompt <- format_prompt(text = '{TWEET}',
instructions = 'Read these tweets posted the day after the US Supreme Court ruled in
favor of a baker who refused to bake a wedding cake for a same-sex couple.
For each tweet, decide whether its sentiment is Positive, Neutral, or Negative.',
examples = scotus_tweets_examples |>
filter(case == 'masterpiece'),
template = 'Tweet: {text}\nSentiment: {label}')
cat(prompt)
Read these tweets posted the day after the US Supreme Court ruled in favor of a baker who refused to bake a wedding cake for a same-sex couple. For each tweet, decide whether its sentiment is Positive, Neutral, or Negative.
Tweet: Thank you Supreme Court I take pride in your decision!!!!✝️ #SCOTUS
Sentiment: Positive
Tweet: Supreme Court rules in favor of Colorado baker! This day is getting better by the minute!
Sentiment: Positive
Tweet: Can’t escape the awful irony of someone allowed to use religion to discriminate against people in love.
Not my Jesus.
#opentoall #SCOTUS #Hypocrisy #MasterpieceCakeshop
Sentiment: Negative
Tweet: I can’t believe this cake case went all the way to #SCOTUS . Can someone let me know what cake was ultimately served at the wedding? Are they married and living happily ever after?
Sentiment: Neutral
Tweet: Supreme Court rules in favor of baker who would not make wedding cake for gay couple
Sentiment: Neutral
Tweet: #SCOTUS set a dangerous precedent today. Although the Court limited the scope to which a business owner could deny services to patrons, the legal argument has been legitimized that one's subjective religious convictions trump (no pun intended) #humanrights. #LGBTQRights
Sentiment: Negative
Tweet: {TWEET}
Sentiment:
Next, we’ll create a function that takes the dataframe of next-word predictions from GPT-3 and converts it into a measure of sentiment. For this tutorial, our measure will be \(P(\text{positive}) - P(\text{negative})\).
gpt3_sentiment_score <- function(df){
p_positive <- df |>
# put responses in all caps
mutate(response = str_to_upper(response)) |>
# keep and sum probabilities assigned to "POS"
filter(str_detect(response, 'POS')) |>
pull(prob) |>
sum()
p_negative <- df |>
# put responses in all caps
mutate(response = str_to_upper(response)) |>
# keep and sum probabilities assigned to "NEG"
filter(str_detect(response, 'NEG')) |>
pull(prob) |>
sum()
return(p_positive - p_negative)
}
We can add this function to the end of the pipeline we developed before.
prompt |>
glue() |>
complete_prompt() |>
gpt3_sentiment_score()
[1] -0.7517173
With these elements in place, let’s loop through the scotus_tweets
dataset and estimate sentiment scores for each tweet referencing Masterpiece Cakeshop. Be mindful before you run this code. At current prices, OpenAI will charge approximately 0.07 cents per prompt, for a total of $0.65 if you classify all 945 tweets.
scotus_tweets$gpt3_sentiment <- NA
for(i in 1:nrow(scotus_tweets)){
# skip the non-masterpiece tweets
if(scotus_tweets$case[i] != 'masterpiece'){
next
}
# get the tweet text
TWEET <- scotus_tweets$text[i]
# format the prompt, send it to GPT-3, and construct the sentiment measure
scotus_tweets$gpt3_sentiment[i] <- prompt |>
glue() |>
complete_prompt() |>
gpt3_sentiment_score()
}
Plotting those sentiment scores against the average of the hand-coded scores, we can see that this measure is much better than the one from the dictionary method we tried before. The correlation between the two scores is 0.75.
ggplot(data = scotus_tweets,
mapping = aes(x = (expert1 + expert2 + expert3) / 3,
y = gpt3_sentiment)) +
geom_jitter(width = 0.1) +
labs(x = 'Hand-Coded Sentiment Score',
y = 'GPT-3 Sentiment Score') +
theme_minimal()
All in all, few-shot prompting an LLM is a powerful method for classifying documents without the need for extensive fine-tuning or large sets of training data, as with supervised learning methods.
In the OCR tutorial, we worked with a newspaper clipping about the sinking of the Titanic.
Let’s focus for now on the first column.
first_column <- image_crop(image,
geometry = '336 x 660 + 0 + 0')
first_column
Though a human can easily read and interpret this text, converting the image to plain text is a non-trivial problem in the field of computer vision. Note that the text is slightly tilted in places, and some lines are squished or cut off. A smudge obscures some letters in the lower left corner.
How well does OCR capture the text from that image?
Titanic Sank at 2:20 A. M. Monday.
In the White Star offices the hope
was held out all day that the Parisian
ond the Virginian had taken off some
-{ the Titanic's passengers, and efforts
were made to get into communication
with these liners, Until such commu-
nication was established the White
Star officials reused tu recusiiee ce
possibility that there were none of the
Titanic’s passengers aboard them.
But by nightfall came the message
from Capt. Haddock of the Olympic to
Cape Race,’ Newfoundland, telling of
the foundering of the Titanic and of
the rescue of 655 of her passengers by
the Cunarder Carpathia, which, the
Wireless message waid, reached the pos!-
tion of the Titanic at daybreak. All
they found there, however, was lif¢-
boats and wreckage. The biggest ship
in the world had sunk at 2:20 o'clock
yesterday morning.
Mr, Franklin admitted late last ulght
thet the Parisian and the Virginian,
though they were among the first to
answer the Titanic’s calls for help,
could not have reached the scene before
10 o'clock yesterday morning, seven
and a half hours after the big Titanic
buried her nese beneath the waves and
Pitched downward out of sight. The
Carpathia, so the wireless dispatch
m Capt. Haddock to Cape Race an-.
jounced, reached the scene of the Ti-,
‘sanic's foundering at daybreak, several
As is common with OCR, the result is generally quite good, but not perfect. Notice, for example, the phrases “reused tu recusiiee ce”, “lif¢-boats”, and “Ti-,’sanic’s”. Look closely at the letters in the image above to see how the OCR algorithm may have made those mistakes. When you consider the image letter-by-letter, ignoring the semantic context in which those letters are placed, it is easy to mistake some letters for others. But when humans read a passage like this, they’re not reading it letter-by-letter. Instead, fluent readers learn to recognize whole words at a time, using their knowledge of the language to “predict” what the constituent letters of a word must be, even when they are difficult to make out on the page. In the same way, we can make use of the fact that GPT-3 is very good at predicting words in sequences to impute what the phrase “reused tu recusiiee ce” is likely to have been, given the context.
To do so, we can prompt GPT-3 like so:
prompt <- glue('Create a copy of the following passage, correcting any OCR errors.\n---\nOriginal Passage:\n\n{text}\n---\nCorrected Passage:\n\n')
cat(prompt)
Create a copy of the following passage, correcting any OCR errors.
---
Original Passage:
Titanic Sank at 2:20 A. M. Monday.
In the White Star offices the hope
was held out all day that the Parisian
ond the Virginian had taken off some
-{ the Titanic's passengers, and efforts
were made to get into communication
with these liners, Until such commu-
nication was established the White
Star officials reused tu recusiiee ce
possibility that there were none of the
Titanic’s passengers aboard them.
But by nightfall came the message
from Capt. Haddock of the Olympic to
Cape Race,’ Newfoundland, telling of
the foundering of the Titanic and of
the rescue of 655 of her passengers by
the Cunarder Carpathia, which, the
Wireless message waid, reached the pos!-
tion of the Titanic at daybreak. All
they found there, however, was lif¢-
boats and wreckage. The biggest ship
in the world had sunk at 2:20 o'clock
yesterday morning.
Mr, Franklin admitted late last ulght
thet the Parisian and the Virginian,
though they were among the first to
answer the Titanic’s calls for help,
could not have reached the scene before
10 o'clock yesterday morning, seven
and a half hours after the big Titanic
buried her nese beneath the waves and
Pitched downward out of sight. The
Carpathia, so the wireless dispatch
m Capt. Haddock to Cape Race an-.
jounced, reached the scene of the Ti-,
‘sanic's foundering at daybreak, several
---
Corrected Passage:
cleaned_text <- complete_prompt(
prompt = prompt,
max_tokens = nchar(text),
model = 'gpt-3.5-turbo')
cat(cleaned_text)
Titanic sank at 2:20 A.M. Monday.
In the White Star offices the hope
was held out all day that the Parisian
and the Virginian had taken off some
of the Titanic's passengers, and efforts
were made to get into communication
with these liners. Until such commu-
nication was established the White
Star officials refused to concede the
possibility that there were none of the
Titanic's passengers aboard them.
But by nightfall came the message
from Capt. Haddock of the Olympic to
Cape Race, Newfoundland, telling of
the foundering of the Titanic and of
the rescue of 655 of her passengers by
the Cunarder Carpathia, which, the
wireless message said, reached the post-
tion of the Titanic at daybreak. All
they found there, however, was life-
boats and wreckage. The biggest ship
in the world had sunk at 2:20 o'clock
yesterday morning.
Mr. Franklin admitted late last night
that the Parisian and the Virginian,
though they were among the first to
answer the Titanic's calls for help,
could not have reached the scene before
10 o'clock yesterday morning, seven
and a half hours after the big Titanic
buried her nose beneath the waves and
pitched downward out of sight. The
Carpathia, so the wireless dispatch
from Capt. Haddock to Cape Race an-
nounced, reached the scene of the Ti-
tanic's foundering at daybreak, several
The resulting text is a nearly perfect transcription, despite the garbled inputs! The text2data::clean_ocr()
function performs these steps in a single function call.
Titanic sank at 2:20 A.M. Monday.
In the White Star offices the hope
was held out all day that the Parisian
and the Virginian had taken off some
of the Titanic's passengers, and efforts
were made to get into communication
with these liners. Until such commu-
nication was established the White
Star officials refused to concede the
possibility that there were none of the
Titanic's passengers aboard them.
But by nightfall came the message
from Capt. Haddock of the Olympic to
Cape Race, Newfoundland, telling of
the foundering of the Titanic and of
the rescue of 655 of her passengers by
the Cunarder Carpathia, which, the
wireless message said, reached the post-
tion of the Titanic at daybreak. All
they found there, however, was life-
boats and wreckage. The biggest ship
in the world had sunk at 2:20 o'clock
yesterday morning.
Mr. Franklin admitted late last night
that the Parisian and the Virginian,
though they were among the first to
answer the Titanic's calls for help,
could not have reached the scene before
10 o'clock yesterday morning, seven
and a half hours after the big Titanic
buried her nose beneath the waves and
pitched downward out of sight. The
Carpathia, so the wireless dispatch
from Capt. Haddock to Cape Race an-
nounced, reached the scene of the Ti-
tanic's foundering at daybreak, several
One of the most labor-intensive tasks in a research workflow involves converting unstructured text to structured datasets. Traditionally, this is a task that has only been suitable for human research assistants, but with LLMs, a truly automated workflow is feasible. Consider the following prompt:
Create a data table from the following passage.
---
The little dog laughed to see such fun, and the dish ran away with the spoon.
---
Data Table:
Character | What They Did
---|---
When we submit this prompt to ChatGPT, it yields a data table delimited by vertical bars, which can then be read into a dataframe.
prompt <- 'Create a data table from the following passage.\n---\nThe little dog laughed to see such fun, and the dish ran away with the spoon.\n---\nData Table:\nCharacter | What They Did \n---|---\n'
response <- complete_prompt(prompt,
max_tokens = 100,
model = 'gpt-3.5-turbo')
response
[1] "Little dog | Laughed \nDish | Ran away with the spoon"
df <- read_delim(response,
delim = '|',
col_names = FALSE)
df
# A tibble: 2 × 2
X1 X2
<chr> <chr>
1 "Little dog " " Laughed "
2 "Dish " " Ran away with the spoon"
Formatting this sort of prompt by hand can be tedious and error-prone, so the parse_text()
function puts all those steps together in a more convenient interface.
parse_text(instructions = 'Create a data table from the following passage.',
text = 'The little dog laughed to see such fun, and the dish ran away with the spoon.',
col_names = c('Character', 'What They Did'))
# A tibble: 2 × 2
Character `What They Did`
<chr> <chr>
1 Little dog Laughed
2 Dish Ran away with the spoon
As with classification, more detailed instructions typically yield better completions.
instructions <- 'Create a data table based on the following passage. The table should include
information on (1) the names of the characters, (2) their hair color, and (3) their shoe sizes.
Use NA for missing information.'
text <- "Jack and Jill went up the hill to fetch a pail of water. Jack fell down and broke his
crown, revealing his stunning golden locks. Investigators on the scene were only able to recover
Jack's shoes (size 10) and Jill's shoes (size 8)."
parse_text(text = text,
instructions = instructions,
col_names = c('Name of Character', 'Hair Color', 'Shoe Size'))
# A tibble: 2 × 3
`Name of Character` `Hair Color` `Shoe Size`
<chr> <chr> <chr>
1 Jack Golden 10
2 Jill NA 8
Take a sample of the Senate press releases from the clustering tutorial. Format a few-shot GPT-3 prompt that returns a list of topics. Are the topic labels sensible? Are they similar to what we came up with using unsupervised methods?
Ask GPT-3 to summarize the remarks on pages 4-5 of the PDF that you imported in the OCR practice problems.
Like many others, I have qualms about using proprietary, closed-source models for scientific research (Spirling 2023). But as of summer 2023 the ease-of-use and capabilities of OpenAI’s models are sufficiently beyond those of similar open-source models that it makes sense to start here. I will update both this page and the R package when I have identified a suitable set of open-source alternatives.↩︎
Note that the default model used by complete_prompt()
is “davinci-002”, the 175-billion parameter base GPT-3 model. You can prompt different model variants using the model
argument.↩︎