Week 10: Data Wrangling

Author

Joe Ornstein

Up to now, we’ve been working with pretty tidy datasets. Every column is a variable, every row is an observation, and every value is where it should be. But things are not always this way. More often than you’re going to like, data comes to you an unruly mess, and you’ll need to tidy it up before you can even start to explore it.

This week, we’ll learn some of the most important functions in the tidyverse for data wrangling. By the end of the week, you will be able to:

Filter the rows of a dataset using conditional logic with the filter() function
Select and rename columns using the select() function
Create new variables and modify existing ones using the mutate() function
Properly format categorical variables, with the help of functions like case_when() and factor()
Perform a basic data merge with left_join()

Reading

R4DS Chapters 3-4
R4DS Chapter 12

Problem Set

This week, we’re going to be summarizing different variables from the CES 2020 dataset. You’ll need to download the dataset (see instructions in the class notes below) and use the codebook to identify which variables contain the information you want, tidy them up, and then summarize. Write an R script that performs these tasks, compile a PDF report, and submit the report.

Here are the questions I’d like you to answer:

Are labor union members more likely to be Democrats or Republicans?
Compare the average age of people with landline phones, compared to those who only have cell phones? Estimate the difference-in-means and report the p-value from a null hypothesis test.
Are people who read the newspaper more likely to correctly answer the questions about who controls the US House of Representatives and US Senate? (Remember, this survey was conducted at the end of 2020.)
Create a bar chart reporting the percent of respondents in each religious group that support making abortion illegal in all circumstances.
Bonus. The CES oversamples some respondents relative to their frequency among registered voters in order to get a better statistical picture of some smaller subgroups. The dataset contains a weight variable for each respondent, reflecting how much that type of respondent is overrepresented in the sample compared to the population. Compare the mean age of the sample with the weighted mean age of the sample. What do you notice? Do the same thing for the unweighted and weighted shares of each racial group. Redo the analyses in questions 1-4 using weighted means. What conclusions, if any, change?

Class Notes

Previously, we have worked with a tidied up version of the Cooperative Election Study (CES) from 2020. To illustrate the process of data wrangling, let’s look at the steps I took to import and clean up that dataset.

Importing The Data

The CES is a survey of a representative sample of 50,000+ American adults, conducted every election year by a large consortium of universities. The data going back to 2006 is available here. Go to that site and download the 2020 dataset, a 180 megabyte monster called CES20_Common_OUTPUT_vv.csv.

The .csv extension at the end of the file means that it is a “comma separated value” file. This is a file where all of the values are separated by commas. Looks something like this:

name, age, party
'Joe', 35, 'Bull Moose'
'Fiona', 6, 'Birthday'
'Avery', 3, 'Know Nothing'

It’s a nice standardized way to represent datasets, and fortunately for us, the read_csv() function allows us to read such files into R.¹ Let’s create an object called ces_raw from that raw dataset.

library(tidyverse)

ces_raw <- read_csv('data/raw/CES20_Common_OUTPUT_vv.csv')

This dataset has a lot of information.

dim(ces_raw)

[1] 61000   717

61,000 rows and 717 columns, to be precise. The first few rows and columns look like this:

head(ces_raw)

# A tibble: 6 × 717
   ...1     caseid commonweight commonpostweight vvweight vvweight_post tookpost
  <dbl>      <dbl>        <dbl>            <dbl>    <dbl>         <dbl>    <dbl>
1     1 1232318697        0.783            0.666    0.851         0.607        2
2     2 1231394915        1.34             1.44    NA            NA            2
3     3 1232451503        0.406            0.342   NA            NA            2
4     4 1232494791        0.958            0.822    1.04          1.00         2
5     5 1232494829        0.195            0.162   NA            NA            2
6     6 1232473675        1.06             0.880    1.15          0.988        2
# ℹ 710 more variables: CCEStake <dbl>, birthyr <dbl>, gender <dbl>,
#   educ <dbl>, race <dbl>, race_other <chr>, hispanic <dbl>,
#   CC20_hisp_1 <dbl>, CC20_hisp_2 <dbl>, CC20_hisp_3 <dbl>, CC20_hisp_4 <dbl>,
#   CC20_hisp_5 <dbl>, CC20_hisp_6 <dbl>, CC20_hisp_7 <dbl>, CC20_hisp_8 <dbl>,
#   CC20_hisp_9 <dbl>, CC20_hisp_10 <dbl>, CC20_hisp_11 <dbl>,
#   CC20_hisp_12 <dbl>, CC20_hisp_t <chr>, CC20_asian_1 <dbl>,
#   CC20_asian_2 <dbl>, CC20_asian_3 <dbl>, CC20_asian_4 <dbl>, …

Recoding Variables

Now that we have the dataset loaded into memory, let’s find the variables we’re interested in keeping and clean them up a bit. For example, gender is coded as a 1 or 2.

ces_raw |> 
  count(gender)

# A tibble: 2 × 2
  gender     n
   <dbl> <int>
1      1 25791
2      2 35209

Let’s make this more readable. The codebook (available at the same site where you downloaded the data) says that 1 is male and 2 is female. We can create a new version of the gender column using the mutate() and if_else() functions.

ces_clean <- ces_raw |> 
  mutate(gender = if_else(gender == 1, 'Male', 'Female'))

ces_clean |> 
  count(gender)

# A tibble: 2 × 2
  gender     n
  <chr>  <int>
1 Female 35209
2 Male   25791

What about the education variable (educ)?

ces_clean |> 
  count(educ)

# A tibble: 6 × 2
   educ     n
  <dbl> <int>
1     1  1983
2     2 16618
3     3 13330
4     4  6539
5     5 14152
6     6  8378

That’s a lot of categories to clean up. We can combine the mutate() function with case_when() to recode a bunch of different values at the same time.

ces_clean <- ces_clean |> 
  mutate(educ = case_when(educ == 1 ~ 'No HS',
                          educ == 2 ~ 'High school graduate',
                          educ == 3 ~ 'Some college',
                          educ == 4 ~ '2-year',
                          educ == 5 ~ '4-year',
                          educ == 6 ~ 'Post-grad'))
ces_clean |> 
  count(educ)

# A tibble: 6 × 2
  educ                     n
  <chr>                <int>
1 2-year                6539
2 4-year               14152
3 High school graduate 16618
4 No HS                 1983
5 Post-grad             8378
6 Some college         13330

Now the variable has the correct labels, but notice that the order in that table is kind of weird. We tend to think of education as an ordinal variable, because there’s a natural order from least to most educated. But R doesn’t know that yet. It just thinks of educ as a character variable, and lists them in alphabetical order. We can tell R about the variable order by recoding educ as a factor.

ces_clean <- ces_clean |> 
  mutate(educ = factor(educ,
                       levels = c('No HS', 
                                  'High school graduate',
                                  'Some college',
                                  '2-year',
                                  '4-year',
                                  'Post-grad')))
ces_clean |> 
  count(educ)

# A tibble: 6 × 2
  educ                     n
  <fct>                <int>
1 No HS                 1983
2 High school graduate 16618
3 Some college         13330
4 2-year                6539
5 4-year               14152
6 Post-grad             8378

Much better!

There’s a variable in CES called birthyr. What if we wanted to know the respondents age in 2020 instead? The mutate() function can help us here too.

ces_clean |> 
  count(birthyr)

# A tibble: 78 × 2
   birthyr     n
     <dbl> <int>
 1    1925     2
 2    1926     6
 3    1927    10
 4    1928    12
 5    1929    23
 6    1930    23
 7    1931    42
 8    1932    55
 9    1933    76
10    1934   109
# ℹ 68 more rows

ces_clean <- ces_clean |> 
  mutate(age = 2020 - birthyr)

ces_clean |> 
  count(birthyr, age)

# A tibble: 78 × 3
   birthyr   age     n
     <dbl> <dbl> <int>
 1    1925    95     2
 2    1926    94     6
 3    1927    93    10
 4    1928    92    12
 5    1929    91    23
 6    1930    90    23
 7    1931    89    42
 8    1932    88    55
 9    1933    87    76
10    1934    86   109
# ℹ 68 more rows

Joining Datasets

Here’s another messy feature of this dataset. Every respondent has a variable called inputstate, representing the state where they live.

ces_clean |> 
  count(inputstate)

# A tibble: 51 × 2
   inputstate     n
        <dbl> <int>
 1          1   947
 2          2   115
 3          4  1463
 4          5   536
 5          6  5035
 6          8  1061
 7          9   642
 8         10   240
 9         11   197
10         12  4615
# ℹ 41 more rows

But it’s a number! Not, like, the actual name of the state, which would be more useful to me. I don’t know all 51 state FIPS codes by heart, but fortunately I have a dataset that includes every state’s name and FIPS code.

state_fips <- tidycensus::fips_codes |> 
  mutate(inputstate = as.numeric(state_code)) |> 
  select(inputstate, state_name) |> 
  unique()

head(state_fips)

    inputstate state_name
1            1    Alabama
68           2     Alaska
97           4    Arizona
112          5   Arkansas
187          6 California
245          8   Colorado

And I can join that information with the CES data like this.

ces_clean <- ces_clean |> 
  left_join(state_fips, by = 'inputstate')

ces_clean |> 
  count(inputstate, state_name)

# A tibble: 51 × 3
   inputstate state_name               n
        <dbl> <chr>                <int>
 1          1 Alabama                947
 2          2 Alaska                 115
 3          4 Arizona               1463
 4          5 Arkansas               536
 5          6 California            5035
 6          8 Colorado              1061
 7          9 Connecticut            642
 8         10 Delaware               240
 9         11 District of Columbia   197
10         12 Florida               4615
# ℹ 41 more rows

We will explore joining datasets more deeply in two weeks.

Selecting Variables

Finally, we can use the select() function to keep only the columns we want, and drop the columns we don’t want.

ces_clean <- ces_clean |> 
  select(caseid, gender, educ, age, state_name)

head(ces_clean)

# A tibble: 6 × 5
      caseid gender educ           age state_name   
       <dbl> <chr>  <fct>        <dbl> <chr>        
1 1232318697 Male   2-year          54 Connecticut  
2 1231394915 Female Post-grad       65 Florida      
3 1232451503 Female 4-year          74 Iowa         
4 1232494791 Female 4-year          58 Massachusetts
5 1232494829 Male   4-year          53 Illinois     
6 1232473675 Male   Some college    59 Ohio

And you can save your cleaned up version of the data to an .RData file with the save() function.

save(ces_clean, file = 'data/ces-clean.RData')

Additional Resources

Wrangling Penguins, by Allison Horst

Footnotes

I will generally use read_csv() instead of read.csv() in my own work, since it is faster and will give you useful error messages if the input is formatted in a way it doesn’t expect.↩︎