Application Programming Interfaces (APIs)

When webscraping is too difficult and/or impolite

In the webscraping tutorial, we harvested text directly from the HTML code of a webpage. This can be pretty laborious, so fortunately, some websites provide an easier path to collecting their data, called an Application Programming Interface (API). Generally, when an API is available, using it is the easiest and most polite way to harvest your text data. On this page, we’ll introduce ourselves to APIs by collecting bill text from the US Congress and video transcripts from YouTube.

Congress API

Suppose we wanted to collect the full text of bills introduced in the US Congress. As a running example, let’s try to load the text of “The Tax Cuts and Jobs Act” (HR 1, 115th Congress), which you can find here. Although we can access that page through our web browser, if I try to read the HTML in R

library(tidyverse)
library(rvest)

page <- read_html("https://www.congress.gov/bill/115th-congress/house-bill/1/text")

…I get a foreboding “HTTP Error 403 - Forbidden” message. It turns out the US Federal Government does not take kindly to bots scraping their webpages.

Fortunately, the good people at congress.gov provide an API that we can use instead. An API endpoint is, in essence, a special web address that returns data instead of a webpage. For congress.gov, these web addresses all begin with https://api.congress.gov/v3/. But to access them, we’ll need an API key, a unique password that identifies each user. You can sign up for one at the top of the documentation page.

Once you have your API key, you will include it as part of the web address you use to access information. Per the API documentation, you can retrieve the text of bills with a web address in the following format:

https://api.congress.gov/v3/bill/{congress}/{billType}/{billNumber}/text?api_key={INSERT_KEY}

Wherever you see curly braces {}, replace them with values for the bill you want. In our example, those values will be {congress} = 115, {billType} = hr, and {billNumber} = 1. Rather than accessing this web address through the browser, we will ask R to read the data from the API endpoint directly.

Step 1: Keep Your API Key Safe

You will be tempted to copy-paste your API key directly into your R script. Avoid this temptation. It’s best practice to keep things like passwords and API keys saved in a separate location, not hard-coded into your scripts. That way, if you share your code (like I’m doing now), you don’t accidentally reveal your secrets. I’m keeping my API key in a text file called congress-api-key.txt. The first step is to read it into memory.

api_key <- read_file('congress-api-key.txt')

Step 2: Get the data from the API

Next, we’ll use the glue package to format web addresses:

library(glue)

congress <- 115
billType <- 'hr'
billNumber <- 1

# glue() uses curly braces when inserting variable names
glue('/bill/{congress}/{billType}/{billNumber}/text')
/bill/115/hr/1/text
url <- glue('https://api.congress.gov/v3/bill/{congress}/{billType}/{billNumber}/text?api_key={api_key}')

Once we’ve formatted the web address, we can use the httr package to get data from the API.

library(httr)
d <- GET(url)

Step 3: Convert to content from JSON to an R object

The object d is a JSON object. If you don’t know what a JSON object is, not to worry! We can just use the jsonlite package to convert it into an R object.

library(jsonlite)

# take the content attribute from d
content <- d$content |>
  # convert from unicode to English characters
  rawToChar() |>
  # convert the JSON format to an R object
  fromJSON()

Now we have an R object (a list) called content. Sadly, this is not yet the content we want. The wrinkle here is that there are 8 versions of the text of this bill, beginning with the first version that was introduced in the House and ending with the final version that was signed into law. The content$textVersions object tells us where to find the full text of each version.

content$textVersions
                  date
1                 <NA>
2 2017-12-20T05:00:00Z
3 2017-12-14T05:00:00Z
4 2017-11-28T05:00:00Z
5 2017-11-16T05:00:00Z
6 2017-11-13T05:00:00Z
7 2017-11-02T04:00:00Z
8 2017-12-22T05:00:00Z
                                                                                                                                                                                                                            formats
1            Formatted Text, PDF, Formatted XML, https://www.congress.gov/115/bills/hr1/BILLS-115hr1enr.htm, https://www.congress.gov/115/bills/hr1/BILLS-115hr1enr.pdf, https://www.congress.gov/115/bills/hr1/BILLS-115hr1enr.xml
2         Formatted Text, PDF, Formatted XML, https://www.congress.gov/115/bills/hr1/BILLS-115hr1eas2.htm, https://www.congress.gov/115/bills/hr1/BILLS-115hr1eas2.pdf, https://www.congress.gov/115/bills/hr1/BILLS-115hr1eas2.xml
3            Formatted Text, PDF, Formatted XML, https://www.congress.gov/115/bills/hr1/BILLS-115hr1eas.htm, https://www.congress.gov/115/bills/hr1/BILLS-115hr1eas.pdf, https://www.congress.gov/115/bills/hr1/BILLS-115hr1eas.xml
4            Formatted Text, PDF, Formatted XML, https://www.congress.gov/115/bills/hr1/BILLS-115hr1pcs.htm, https://www.congress.gov/115/bills/hr1/BILLS-115hr1pcs.pdf, https://www.congress.gov/115/bills/hr1/BILLS-115hr1pcs.xml
5               Formatted Text, PDF, Formatted XML, https://www.congress.gov/115/bills/hr1/BILLS-115hr1eh.htm, https://www.congress.gov/115/bills/hr1/BILLS-115hr1eh.pdf, https://www.congress.gov/115/bills/hr1/BILLS-115hr1eh.xml
6               Formatted Text, PDF, Formatted XML, https://www.congress.gov/115/bills/hr1/BILLS-115hr1rh.htm, https://www.congress.gov/115/bills/hr1/BILLS-115hr1rh.pdf, https://www.congress.gov/115/bills/hr1/BILLS-115hr1rh.xml
7               Formatted Text, PDF, Formatted XML, https://www.congress.gov/115/bills/hr1/BILLS-115hr1ih.htm, https://www.congress.gov/115/bills/hr1/BILLS-115hr1ih.pdf, https://www.congress.gov/115/bills/hr1/BILLS-115hr1ih.xml
8 Formatted Text, PDF, Formatted XML, https://www.congress.gov/115/plaws/publ97/PLAW-115publ97.htm, https://www.congress.gov/115/plaws/publ97/PLAW-115publ97.pdf, https://www.congress.gov/115/plaws/publ97/PLAW-115publ97_uslm.xml
                        type
1              Enrolled Bill
2 Engrossed Amendment Senate
3 Engrossed Amendment Senate
4  Placed on Calendar Senate
5         Engrossed in House
6          Reported in House
7        Introduced in House
8                 Public Law

Step 4: Get The Bill Text

Let’s get the text for the latest version of the bill (the version that became law).

mostRecent <- content$textVersions |> 
  # keep the row with the most recent date
  slice_max(date, n = 1) |> 
  # pull the formats column
  pull(formats)

mostRecent
[[1]]
            type
1 Formatted Text
2            PDF
3  Formatted XML
                                                                url
1      https://www.congress.gov/115/plaws/publ97/PLAW-115publ97.htm
2      https://www.congress.gov/115/plaws/publ97/PLAW-115publ97.pdf
3 https://www.congress.gov/115/plaws/publ97/PLAW-115publ97_uslm.xml
# get the URL for the most recent Formatted Text
textURL <- mostRecent[[1]] |> 
  filter(type == 'Formatted Text') |> 
  pull(url)

# read the text
text <- read_file(textURL)

Finally! The object text contains the entire text of the bill. Printing this in its entirety would be too long, but here’s a snippet:

text |> 
  substr(1,1000) |> 
  cat()
<html><body><pre>
[115th Congress Public Law 97]
[From the U.S. Government Publishing Office]



[[Page 2053]]

                                     

                                     

                                     

                                     

                                     

    NOTE: Public Law 115-97 is re-printed to remove the editorial 
description contained on the original half title page of the printed 
Slip law.

&lt;star&gt; (Star Print)

[[Page 131 STAT. 2054]]

Public Law 115-97
115th Congress

                                 An Act


 
    To provide for reconciliation pursuant to titles II and V of the 
 concurrent resolution on the budget for fiscal year 2018. &lt;&lt;NOTE: Dec. 
                         22, 2017 -  [H.R. 1]&gt;&gt; 

    Be it enacted by the Senate and House of Representatives of the 
United States of America in Congress assembled,

                                 TITLE I

SECTION 11000. SHORT TITLE, ETC.

    (a) Amend

Putting Steps 1 through 4 into a function

Steps 1 through 4 make up the entire workflow for retrieving the text of a bill from the congress.gov API. If we ever plan to use that workflow again, it would be wise to encode it as a function. That way we’re not copy-pasting large blocks of code every time we want to get the text of a new bill. Let’s create a new function called get_bill_text(). This function will take as inputs the congress, billType and billNumber of a bill, complete Steps 1-4, and return the full text of the bill we want.

get_bill_text <- function(congress, billType, billNumber){
  
  # Step 1: Get the API Key
  api_key <- read_file('congress-api-key.txt')
  
  # Step 2: Read the data from the API
  url <- glue('https://api.congress.gov/v3/bill/{congress}/{billType}/{billNumber}/text?api_key={api_key}')
  
  d <- GET(url)
  
  # Step 3: Convert the JSON object to an R list
  content <- d$content |>
    rawToChar() |>
    fromJSON()
  
  # Step 4: Get the most recent bill text
  mostRecent <- content$textVersions |>
    slice_max(date, n = 1) |>
    pull(formats)

  textURL <- mostRecent[[1]] |> 
    filter(type == 'Formatted Text') |> 
    pull(url)

  text <- read_file(textURL)
  
  # Return the text object
  return(text)
}

Does the function work? If so, it should get the same text object we created before:

text == get_bill_text(congress = 115, billType = 'hr', billNumber = 1)
[1] TRUE

Play with the function a bit, changing the inputs to make sure that it returns the correct text when you ask for a different bill. And see the practice problems at the end of the page to get more practice working with the congress.gov API.

YouTube API

In this next tutorial, we’ll see how to retrieve YouTube video transcripts using its API and the youtube-transcript-api Python module. This Python module allows us to avoid all tedious the work we did formatting web addresses for the congress.gov API. We just need a way to run Python modules from R. Fortunately, the reticulate package does just that.

How to Drive Python from RStudio

For R users, the reticulate package is a convenient way to run Python code and return outputs as R objects. The setup will be a little bit different depending on whether you’re working with a PC or Mac.

Windows PC Setup

If you’re on a Windows machine, first install the reticulate package through the R console, then install a version of Python, using the following two lines of code:

install.packages('reticulate')
reticulate::install_miniconda()

Mac Setup

If you are a Mac user, you should follow the instructions for Windows PCs above, but then you’re going to create a Python “virtual environment”, using the following command in the R console.

reticulate::conda_create('myenv')

Next, restart your R session. Then enter the following command to use the virtual environment you just created.

reticulate::use_condaenv('myenv')

Install the Python package

Regardless of your operating system, you’ll now want to install the youtube-transcript-api Python module, using the following function from reticulate.

reticulate::py_install('youtube-transcript-api')

Setup is done! Let’s play.

Step 1: Get the video ID

Suppose we want the transcript of this YouTube video:

You can click the CC button in the bottom right of the video to see YouTube’s auto-generated transcript. Note that the quality of these transcripts can vary significantly, depending on the quality of the audio.

To get the transcript using R, we’ll need the video ID, which you can find at the end of the video’s URL (the bolded text here):

https://www.youtube.com/watch?v=mLyOj_QD4a4

Step 2: Create a Python object

Next, we’ll import the Python module, creating an object called youtubecaption.

youtubecaption <- reticulate::import('youtube_transcript_api')

Step 3: Get the transcript

We can then use the Python module to get the transcript of the video we want.

d <- youtubecaption$YouTubeTranscriptApi$get_transcript('mLyOj_QD4a4')

Notice that the object d is a list object with 50 entries. Each element contains a snippet of text from the transcript, along with its timestamp and duration.

d[[1]]
$text
[1] "okay guys"

$start
[1] 3.52

$duration
[1] 2.56
d[[2]]
$text
[1] "uh these eggs have given us a lot of"

$start
[1] 4.56

$duration
[1] 3.52

Step 4 (optional): Paste together the transcript parts

If we would rather have the entire transcript in a single character object, we can loop through the list and paste all the text snippets together, like so:

# start with an empty character object
transcript <- ''

# for each element in the list, paste it onto the existing transcript
for(i in 1:length(d)){
  transcript <- paste(transcript, d[[i]]$text)
}

transcript
[1] " okay guys uh these eggs have given us a lot of trouble in the past uh does anybody need anything off this guy or can we bypass him i think leroy needs something from this guy oh he needs his devout shoulders doesn't isn't he paladin yeah but that'll help him heal better i'll have more manner christ okay uh well what we'll do i'll run in first uh gather up all the eggs we can kind of just you know blast and all down with aoe um i will use intimidating shout to kind of scatter them so we don't have to fight a whole bunch of them at once uh when my shout's done uh i'll need anthony to come in and drop his shout too uh so we can keep him scattered not to fight too many um when his is done bass of course need to run in and do the same thing uh we're getting divine intervention on our mages uh so they can uh ae uh so we can of course get them down fast because we're bringing all these guys i mean we'll be in trouble if we don't take them down quick i think it's a pretty good plan we should be able to pull it off this time uh what do you think abdul can you give me a number crunch real quick uh yeah give me a sec i'm coming up with 32.33 uh repeating of course percentage of survival well it's a lot better than we usually do uh all right comes up ready guys oh my god oh my god god damn it leroy oh we do have a soul stone up don't we oh god for great job leroy you were just stupid as hell at least i have chicken"

The transcript object is now a single character object with the entire video transcript.

Putting Steps 1 through 4 into a function

As in the last tutorial, it’s best practice to write a function any time you’ve created a multi-step workflow that you’d like to use more than once.

get_youtube_transcript <- function(video_id, lang = 'en'){

  # 1. create an object from the python package
  youtubecaption <- reticulate::import('youtube_transcript_api')

  # 2. get the transcript from the video you want
  d <- youtubecaption$YouTubeTranscriptApi$get_transcript(video_id,
                                                          languages = c(lang, 'en'))


  # 3. paste together the transcript snippets
  transcript <- ''
  for(i in 1:length(d)){
    transcript <- paste(transcript, d[[i]]$text)
  }

  return(transcript)
}

Notice that I’ve written this function with two inputs: the video ID and an option to create transcripts for videos in different languages. For a list of available language codes, see the ISO 639-1 column here.

To verify that the function works, let’s pull the transcript of the 2023 State of the Union Address.

sotu_transcript <- get_youtube_transcript(video_id = 'gzcBTUvVp7M')

sotu_transcript |> 
  substr(1, 1000) |> 
  cat()
 foreign Mr Speaker the president of the United States [Applause] thank you [Applause] foreign foreign [Applause] foreign foreign [Applause] [Applause] foreign [Applause] members of Congress I have the high privilege and the distinct honor to present to you the president of the United States Mr Speaker thank you you can smile it's okay thank you thank you thank you thank you please Mr Speaker Madam vice president our first lady and second gentleman good to see you guys up there members of Congress by the way chief justice I may need a court order she gets to go to the the game tomorrow next week I have to stay home got to work something out here members of the cabinet leaders of our military chief justice associate Justice and retired Justice Supreme Court and to you my fellow Americans you know I start tonight by congratulating 118th Congress and the new Speaker of the House Kevin McCarthy speaker I don't want to ruin your reputation but I look forward to working with you and I want t

Practice Problems

  1. Get the full text of the Patient Protection and Affordable Care Act (2010).

  2. Create a dataframe with information on several Congressional bills, including columns for congress, billType, billNumber, and full_text. Using the get_bill_text() function we created, collect the full text for at least 10 bills in this dataframe.

  3. Create a function to get the Congressional Research Service’s summary of a bill from the congress.gov API. You’ll need to check the documentation to figure out how to format the URL for the correct API endpoint.

  4. The sotu package is missing some of the most recent State of the Union addresses. See if you can collect their transcripts through YouTube. How is the quality of those transcripts?

Further Reading