Text As Data: Large Language Models

Large language models (LLMs) have transformed the way computer scientists approach natural language processing over the past decade. These models, trained to predict the next word in a sequence, make use of a massive quantity of text data from the Internet and digitized books. In this tutorial, I will hand-wave over exactly how LLMs perform this task, focusing instead on how to adapt such models for use in social science applications. If you are interested in what’s going on under the hood, I highly recommend the following two blog posts for not-too-technical introductions:

In a nutshell, LLMs represent words using embeddings, and are trained to predict the most likely next word in a sequence using a model architecture called the transformer (Vaswani et al. 2017). A powerful feature of this type of model is that the embeddings representing the input sequence are iteratively updated based on which words appear nearby (a process called self-attention). This allows an LLM to flexibly represent words based on their context, which is useful in cases where the same word can mean different things in different contexts. For example, an LLM will represent the word “bill” differently depending on whether it appears in the phrase “sign the bill”, “foot the bill”, or “Hillary and Bill”.

What’s remarkable is that if we have a model that is sufficiently good at predicting the next word in a sequence, we can adapt it to perform all sorts of text-as-data tasks, by creating a prompt that converts our desired task into a next-word prediction problem (Ornstein, Blasingame, and Truscott 2025). To show you how, we will use my promptr R package to create LLM prompts and submit them to OpenAI’s language models.¹ Let’s start by getting the R package set up.

Setting Up The `promptr` Package

The package is currently available on CRAN. You can install with the following line of code:

install.packages('promptr')

Next, you will need an account with OpenAI. You can sign up for one here, after which you will need to generate an API key here. I recommend adding this API key as a variable in your operating system environment called OPENAI_API_KEY; that way you won’t risk leaking it by hard-coding it into your R scripts. The promptr package will automatically look for your API key under that variable name, and will prompt you to enter the API key manually if it can’t find one there. The easiest way to do so is by copy-pasting your API key into the following line of code:

promptr::openai_api_key('<YOUR KEY GOES HERE>', install = TRUE)

Once you’ve done this you may need to restart your R session. Once you’ve done that, we can begin.

Completing Prompts

The workhorse function of the promptr package is complete_prompt(). This function submits a sequence of words (the prompt) to the OpenAI API, and returns a dataframe with the five highest-probability next word predictions and their associated probabilities.²

library(promptr)

complete_prompt(prompt = 'My favorite food is')

  token probability
1 pizza  0.21228811
2 sushi  0.04108451
3 pasta  0.03349724
4        0.02689665
5     a  0.02677079

If you prefer the model to auto-regressively generate sequences of text instead of outputting the next-word probabilities, set the max_tokens argument greater than 1. The function will return a character object with the most likely completion at each point in the sequence.

complete_prompt(prompt = 'My favorite food is', 
                max_tokens = 6)

[1] " pizza. I love the combination"

If we want thr LLM to perform a classification task, we can structure the prompt so that the best next-word prediction is the classification we want (a process called adaptation). Consider, for example, the following prompt.

prompt <- 'Decide whether the following statement is happy, sad, or neither.\n\nText: I feel happy.\nClassification:'

cat(prompt)

Decide whether the following statement is happy, sad, or neither.

Text: I feel happy.
Classification:

complete_prompt(prompt)

  token probability
1 Happy  0.43092548
2 happy  0.36317128
3        0.12109015
4 Happy  0.03856193
5 happy  0.01665111

The result is a probability vector with the most likely completions. The correct classification – happy – is assigned the highest probability, though note that some post-processing is necessary to combine the “happy” and “Happy” responses.

A few-shot prompt includes several completed examples in the text of the prompt to demonstrate the desired output. Prompts structured this way tend perform significantly better than zero-shot prompts with no examples (Brown et al. 2020).

prompt <- 'Decide whether the following statement is happy, sad, or neither.\n\nText: What should we do today?.\nClassification: Neither\n\nText: My puppy is so cute today.\nClassification: Happy\n\nText: The news is bumming me out.\nClassification: Sad\n\nText: I feel happy.\nClassification:'

cat(prompt)

Decide whether the following statement is happy, sad, or neither.

Text: What should we do today?.
Classification: Neither

Text: My puppy is so cute today.
Classification: Happy

Text: The news is bumming me out.
Classification: Sad

Text: I feel happy.
Classification:

complete_prompt(prompt)

      token  probability
1     Happy 0.9878519141
2     Happy 0.0089504721
3   Neither 0.0015561093
4     happy 0.0003720904
5 Happiness 0.0002116472

Formatting Prompts

Manually typing prompts with multiple few-shot examples can be tedious and error-prone, particularly when performing the sort of context-specific prompting we recommend in our paper (Ornstein, Blasingame, and Truscott 2025). The format_prompt() function is a useful tool to aid in that process.

The function is designed with classification problems in mind. If you input the text you would like to classify along with a set of instructions, the default prompt template looks like this:

library(promptr)

format_prompt(text = 'I am feeling happy today.',
              instructions = 'Decide whether this statment is happy, sad, or neither.')

Decide whether this statment is happy, sad, or neither.

Text: I am feeling happy today.
Classification:

You can customize the template using glue syntax, with placeholders for the text you want to classify {text} and the desired label {label}.

format_prompt(text = 'I am feeling happy today.',
              instructions = 'Decide whether this statment is happy or sad.',
              template = 'Statement: {text}\nSentiment: {label}')

Decide whether this statment is happy or sad.

Statement: I am feeling happy today.
Sentiment:

This is particularly useful when including few-shot examples in the prompt. If you input these examples as a tidy dataframe with the columns text and label, the format_prompt() function will paste them into the prompt them according to the template. To illustrate, let’s classify the sentiment of a set of tweets about the Supreme Court of the United States, a dataset which is included with the promptr package.

Classifying Documents

data(scotus_tweets) # the full dataset
data(scotus_tweets_examples) # a sample of labeled examples

We can format our few-shot prompt template using format_prompt(), leaving the {TWEET} placeholder for the text we want to classify:

library(tidyverse)
library(glue)

prompt <- format_prompt(text = '{TWEET}',
              instructions = 'Classify the sentiment of these tweets as Positive, Neutral, or Negative.',
              examples = scotus_tweets_examples |> 
                filter(case == 'masterpiece'),
              template = 'Tweet: {text}\nSentiment: {label}')

prompt

Classify the sentiment of these tweets as Positive, Neutral, or Negative.

Tweet: Thank you Supreme Court I take pride in your decision!!!!✝️ #SCOTUS
Sentiment: Positive

Tweet: Supreme Court rules in favor of Colorado baker! This day is getting better by the minute!
Sentiment: Positive

Tweet: Can’t escape the awful irony of someone allowed to use religion to discriminate against people in love. 
Not my Jesus. 
#opentoall #SCOTUS #Hypocrisy #MasterpieceCakeshop
Sentiment: Negative

Tweet: I can’t believe this cake case went all the way to #SCOTUS . Can someone let me know what cake was ultimately served at the wedding? Are they married and living happily ever after?
Sentiment: Neutral

Tweet: Supreme Court rules in favor of baker who would not make wedding cake for gay couple
Sentiment: Neutral

Tweet: #SCOTUS set a dangerous precedent today. Although the Court limited the scope to which a business owner could deny services to patrons, the legal argument has been legitimized that one's subjective religious convictions trump (no pun intended) #humanrights. #LGBTQRights
Sentiment: Negative

Tweet: {TWEET}
Sentiment:

Using the glue() function, we can insert tweets into that template.

TWEET <- scotus_tweets$text[42]
TWEET

[1] "This Supreme Court ruling highlights why there needs to be term limits for scotus, healthcare needs to uncoupled from employment, Christianity brings nothing but evil but, most importantly it shows you why voting at every level matters"

glue(prompt)

Classify the sentiment of these tweets as Positive, Neutral, or Negative.

Tweet: Thank you Supreme Court I take pride in your decision!!!!✝️ #SCOTUS
Sentiment: Positive

Tweet: Supreme Court rules in favor of Colorado baker! This day is getting better by the minute!
Sentiment: Positive

Tweet: Can’t escape the awful irony of someone allowed to use religion to discriminate against people in love. 
Not my Jesus. 
#opentoall #SCOTUS #Hypocrisy #MasterpieceCakeshop
Sentiment: Negative

Tweet: I can’t believe this cake case went all the way to #SCOTUS . Can someone let me know what cake was ultimately served at the wedding? Are they married and living happily ever after?
Sentiment: Neutral

Tweet: Supreme Court rules in favor of baker who would not make wedding cake for gay couple
Sentiment: Neutral

Tweet: #SCOTUS set a dangerous precedent today. Although the Court limited the scope to which a business owner could deny services to patrons, the legal argument has been legitimized that one's subjective religious convictions trump (no pun intended) #humanrights. #LGBTQRights
Sentiment: Negative

Tweet: This Supreme Court ruling highlights why there needs to be term limits for scotus, healthcare needs to uncoupled from employment, Christianity brings nothing but evil but, most importantly it shows you why voting at every level matters
Sentiment:

We can then pipe that prompt into complete_prompt() to output the desired classification:

prompt |> 
  glue() |> 
  complete_prompt()

     token  probability
1 Negative 0.9034858269
2  Neutral 0.0913403445
3 Positive 0.0022436775
4 Negative 0.0020078616
5  Neutral 0.0005211055

Classifiation Performance

As we saw in the sentiment analysis tutorial, conventional methods for classifying sentiment perform pretty poorly on text from social media. Does this approach perform any better? To find out, let’s complete the prompt for each of the tweets in the scotus_tweets dataframe related to the Masterpiece Cakeshop ruling, create a measure of sentiment, and compare it against the human coders.

First, create our prompt template with format_prompt(), this time with a bit more detail in the instructions. For few-shot examples, we’ll use the ‘masterpiece’ tweets in scotus_tweets_examples. These are six tweets that were unanimously coded by three human annotators as Positive, Negative, or Neutral (two per category).

prompt <- format_prompt(text = '{TWEET}',
              instructions = 'Read these tweets posted the day after the US Supreme Court ruled in 
              favor of a baker who refused to bake a wedding cake for a same-sex couple. 
              For each tweet, decide whether its sentiment is Positive, Neutral, or Negative.',
              examples = scotus_tweets_examples |> 
                filter(case == 'masterpiece'),
              template = 'Tweet: {text}\nSentiment: {label}')

cat(prompt)

Read these tweets posted the day after the US Supreme Court ruled in favor of a baker who refused to bake a wedding cake for a same-sex couple. For each tweet, decide whether its sentiment is Positive, Neutral, or Negative.

Tweet: Thank you Supreme Court I take pride in your decision!!!!✝️ #SCOTUS
Sentiment: Positive

Tweet: Supreme Court rules in favor of Colorado baker! This day is getting better by the minute!
Sentiment: Positive

Tweet: Can’t escape the awful irony of someone allowed to use religion to discriminate against people in love. 
Not my Jesus. 
#opentoall #SCOTUS #Hypocrisy #MasterpieceCakeshop
Sentiment: Negative

Tweet: I can’t believe this cake case went all the way to #SCOTUS . Can someone let me know what cake was ultimately served at the wedding? Are they married and living happily ever after?
Sentiment: Neutral

Tweet: Supreme Court rules in favor of baker who would not make wedding cake for gay couple
Sentiment: Neutral

Tweet: #SCOTUS set a dangerous precedent today. Although the Court limited the scope to which a business owner could deny services to patrons, the legal argument has been legitimized that one's subjective religious convictions trump (no pun intended) #humanrights. #LGBTQRights
Sentiment: Negative

Tweet: {TWEET}
Sentiment:

Next, we’ll create a function that takes the dataframe of next-word predictions from GPT-3 and converts it into a measure of sentiment. For this tutorial, our measure will be $P(\text{positive}) - P(\text{negative})$.

sentiment_score <- function(df){
  
  p_positive <- df |> 
    # put responses in all caps
    mutate(response = str_to_upper(token)) |> 
    # keep and sum probabilities assigned to "POS"
    filter(str_detect(response, 'POS')) |> 
    pull(probability) |> 
    sum()
  
  p_negative <- df |> 
    # put responses in all caps
    mutate(response = str_to_upper(token)) |> 
    # keep and sum probabilities assigned to "NEG"
    filter(str_detect(response, 'NEG')) |> 
    pull(probability) |> 
    sum()
  
  return(p_positive - p_negative)
}

We can add this function to the end of the pipeline we developed before.

prompt |> 
  glue() |> 
  complete_prompt() |> 
  sentiment_score()

[1] -0.9355278

With these elements in place, let’s loop through the scotus_tweets dataset and estimate sentiment scores for each tweet referencing Masterpiece Cakeshop. Be mindful before you run this code. At current prices, OpenAI will charge $1.50 per million input toikens, which works out to approximately 1/30 of a cent per prompt, for a total of $0.30 if you classify all 945 tweets.

Let’s create a formatted few-shot prompt for each tweet about the Masterpiece Cakeshop decision, and return the sentiment score for each.

masterpiece_tweets <- scotus_tweets |> 
  filter(case == 'masterpiece')

instructions <- 'Read these tweets posted the day after the US Supreme Court ruled in favor of a baker who refused to bake a wedding cake for a same-sex couple (Masterpiece Cakeshop, 2018). For each tweet, decide whether its sentiment is Positive, Neutral, or Negative.'

masterpiece_examples <- scotus_tweets_examples |> 
  filter(case == 'masterpiece')

masterpiece_tweets$prompt <- format_prompt(text = masterpiece_tweets$text,
                                           instructions = instructions,
                                           examples = masterpiece_examples)

masterpiece_tweets$prompt[3]

Now submit the entire list of prompts to complete_prompt():

masterpiece_tweets$out <- complete_prompt(masterpiece_tweets$prompt)

The estimated probability distribution for each completion is now a list of dataframes in the out column. We can compute our sentiment score for each one:

masterpiece_tweets$score <- masterpiece_tweets$out |> 
  lapply(mutate, token = str_to_lower(token)) |> 
  lapply(summarize, 
         positive = sum(probability[token=='positive']), 
         negative = sum(probability[token=='negative'])) |>
  lapply(summarize,score=positive-negative) |> 
  unlist()

Plotting those sentiment scores against the average of the hand-coded scores, we can see that this measure is much better than the one from the dictionary method we tried before. The correlation between the two scores is 0.79.

ggplot(data = masterpiece_tweets,
       mapping = aes(x = (expert1 + expert2 + expert3) / 3,
                     y = score)) +
  geom_jitter(width = 0.1) +
  labs(x = 'Hand-Coded Sentiment Score',
       y = 'GPT-3.5 Sentiment Score') +
  theme_minimal()

All in all, few-shot prompting an LLM is a powerful method for classifying documents without the need for extensive fine-tuning or large sets of training data, as with supervised learning methods.

Cleaning Up OCR

In the OCR tutorial, we worked with a newspaper clipping about the sinking of the Titanic.

library(tesseract)
library(magick)
library(promptr)

image <- image_read('img/titanic.png')
image

Let’s focus for now on the first column.

first_column <- image_crop(image,
                           geometry = '336 x 660 + 0 + 0')

first_column

Though a human can easily read and interpret this text, converting the image to plain text is a non-trivial problem in the field of computer vision. Note that the text is slightly tilted in places, and some lines are squished or cut off. A smudge obscures some letters in the lower left corner.

How well does OCR capture the text from that image?

text <- ocr(first_column)

cat(text)

Titanic Sank at 2:20 A. M. Monday.
In the White Star offices the hope
was held out all day that the Parisian
ond the Virginian had taken off some
-{ the Titanic's passengers, and efforts
were made to get into communication
with these liners, Until such commu-
nication was established the White
Star officials reused tu recusiiee ce
possibility that there were none of the
Titanic’s passengers aboard them.

But by nightfall came the message
from Capt. Haddock of the Olympic to
Cape Race,’ Newfoundland, telling of
the foundering of the Titanic and of
the rescue of 655 of her passengers by
the Cunarder Carpathia, which, the
Wireless message waid, reached the pos!-
tion of the Titanic at daybreak. All
they found there, however, was lif¢-
boats and wreckage. The biggest ship
in the world had sunk at 2:20 o'clock
yesterday morning.

Mr, Franklin admitted late last ulght
thet the Parisian and the Virginian,
though they were among the first to
answer the Titanic’s calls for help,
could not have reached the scene before
10 o'clock yesterday morning, seven
and a half hours after the big Titanic
buried her nese beneath the waves and
Pitched downward out of sight. The
Carpathia, so the wireless dispatch

m Capt. Haddock to Cape Race an-.
jounced, reached the scene of the Ti-,
‘sanic's foundering at daybreak, several

As is common with OCR, the result is generally quite good, but not perfect. Notice, for example, the phrases “reused tu recusiiee ce”, “lif¢-boats”, and “Ti-,’sanic’s”. Look closely at the letters in the image above to see how the OCR algorithm may have made those mistakes. When you consider the image letter-by-letter, ignoring the semantic context in which those letters are placed, it is easy to mistake some letters for others. But when humans read a passage like this, they’re not reading it letter-by-letter. Instead, fluent readers learn to recognize whole words at a time, using their knowledge of the language to “predict” what the constituent letters of a word must be, even when they are difficult to make out on the page. In the same way, we can make use of the fact that LLMs are very good at predicting words in sequences to impute what the phrase “reused tu recusiiee ce” is likely to have been, given the context.

To do so, we can prompt the LLM like so:

prompt <- glue('Correct OCR errors in the following passage.\n---\nOriginal Passage:\n\n{text}\n---\nCorrected Passage:\n\n')

cat(prompt)

Correct OCR errors in the following passage.
---
Original Passage:

Titanic Sank at 2:20 A. M. Monday.
In the White Star offices the hope
was held out all day that the Parisian
ond the Virginian had taken off some
-{ the Titanic's passengers, and efforts
were made to get into communication
with these liners, Until such commu-
nication was established the White
Star officials reused tu recusiiee ce
possibility that there were none of the
Titanic’s passengers aboard them.

But by nightfall came the message
from Capt. Haddock of the Olympic to
Cape Race,’ Newfoundland, telling of
the foundering of the Titanic and of
the rescue of 655 of her passengers by
the Cunarder Carpathia, which, the
Wireless message waid, reached the pos!-
tion of the Titanic at daybreak. All
they found there, however, was lif¢-
boats and wreckage. The biggest ship
in the world had sunk at 2:20 o'clock
yesterday morning.

Mr, Franklin admitted late last ulght
thet the Parisian and the Virginian,
though they were among the first to
answer the Titanic’s calls for help,
could not have reached the scene before
10 o'clock yesterday morning, seven
and a half hours after the big Titanic
buried her nese beneath the waves and
Pitched downward out of sight. The
Carpathia, so the wireless dispatch

m Capt. Haddock to Cape Race an-.
jounced, reached the scene of the Ti-,
‘sanic's foundering at daybreak, several

---
Corrected Passage:

cleaned_text <- complete_prompt(
  prompt = prompt,
  max_tokens = nchar(text))

cat(cleaned_text)


Titanic sank at 2:20 A.M. Monday.
In the White Star offices, the hope
was held out all day that the Parisian
and the Virginian had taken off some
of the Titanic's passengers, and efforts
were made to get into communication
with these liners. Until such communication was established, the White
Star officials refused to acknowledge the possibility that there were none of the Titanic's passengers aboard them.

But by nightfall, the message came
from Capt. Haddock of the Olympic to
Cape Race, Newfoundland, telling of
the foundering of the Titanic and of
the rescue of 655 of her passengers by
the Cunarder Carpathia, which, the
wireless message said, reached the position of the Titanic at daybreak. All
they found there, however, were lifeboats and wreckage. The biggest ship in the world had sunk at 2:20 o'clock yesterday morning.

Mr. Franklin admitted late last night
that the Parisian and the Virginian,
though they were among the first to
answer the Titanic's calls for help,
could not have reached the scene before
10 o'clock yesterday morning, seven
and a half hours after the big Titanic
buried her nose beneath the waves and
pitched downward out of sight. The
Carpathia, according to the wireless dispatch from Capt. Haddock to Cape Race,
reached the scene of the Titanic's foundering at daybreak, several hours after the sinking.

The resulting text is a nearly perfect transcription, despite the garbled inputs!

Text To Data

One of the most labor-intensive tasks in a research workflow involves converting unstructured text to structured datasets. Traditionally, this is a task that has only been suitable for human research assistants, but with LLMs, a truly automated workflow is feasible. Consider the following prompt:

Create a data table from the following passage.
---
The little dog laughed to see such fun, and the dish ran away with the spoon.
---
Data Table:
Character | What They Did 
---|---

When we submit this prompt to ChatGPT, it yields a data table delimited by vertical bars, which can then be read into a dataframe.

prompt <- 'Create a data table from the following passage.\n---\nThe little dog laughed to see such fun, and the dish ran away with the spoon.\n---\nData Table:\nCharacter | What They Did \n---|---\n'

response <- complete_prompt(prompt, 
                            max_tokens = 100)

response

[1] "Little Dog | Laughed \nDish | Ran away \nSpoon | Ran away"

df <- read_delim(response,
                 delim = '|',
                 col_names = FALSE)

df

# A tibble: 3 × 2
  X1            X2          
  <chr>         <chr>       
1 "Little Dog " " Laughed " 
2 "Dish "       " Ran away "
3 "Spoon "      " Ran away"

Practice Problems

Take a sample of the Senate press releases from the clustering tutorial. Format a few-shot ChatGPT prompt that returns a list of topics. Are the topic labels sensible? Are they similar to what we came up with using unsupervised methods?
Ask ChatGPT to summarize the remarks on pages 4-5 of the PDF that you imported in the OCR practice problems.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” arXiv:2005.14165 [Cs], July. http://arxiv.org/abs/2005.14165.

Ornstein, Joseph T., Elise N. Blasingame, and Jake S. Truscott. 2025. “How to Train Your Stochastic Parrot: Large Language Models for Political Texts.” Political Science Research and Methods, January, 1–18. https://doi.org/10.1017/psrm.2024.64.

Spirling, Arthur. 2023. “Why Open-Source Generative AI Models Are an Ethical Way Forward for Science.” Nature 616 (7957): 413–13. https://doi.org/10.1038/d41586-023-01295-4.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” arXiv:1706.03762 [Cs], December. https://arxiv.org/abs/1706.03762.

Like many others, I have qualms about using proprietary, closed-source models for scientific research (Spirling 2023). But as of summer 2025 the ease-of-use and capabilities of OpenAI’s models are sufficiently beyond those of similar open-source models that it makes sense to start here.↩︎
Note that the default model used by complete_prompt() is “gpt-3.5-turbo-instruct”, essentially the model underlying the original ChatGPT. You can prompt different model variants using the model argument.↩︎

Large Language Models

Setting Up The promptr Package