Week 3

Wrangle your data

Reading

Optional Practice

Overview

Up to now, we’ve been working with pretty tidy datasets. Every column is a variable, every row is an observation, and every value is where it should be. But things are not always this way. More often than you’re going to like, data comes to you an unruly mess, and you’ll need to tidy it up before you can even start to explore it.

This week, we’ll learn some of the most important functions in the tidyverse for data importing, wrangling, and summarizing.

In your last problem set, you worked with a tidied up version of the Cooperative Election Study (CES) from 2020. Let’s look at the steps I took to import and clean up that dataset.

Importing The Data

The CES is a survey of a representative sample of 50,000+ American adults, conducted every election year by a large consortium of universities. The data going back to 2006 is available here. Go to that site and download the 2020 dataset, a 180 megabyte monster called CES20_Common_OUTPUT_vv.csv.

The .csv extension at the end of the file means that it is a “comma separated value” file. This is a file where all of the values are separated by commas. Looks something like this:

name, age, party
'Joe', 35, 'Bull Moose'
'Fiona', 6, 'Birthday'
'Avery', 3, 'Know Nothing'

It’s a nice standardized way to represent datasets, and fortunately for us, the read_csv() function allows us to read such files into R. Let’s create an object called ces_raw from that raw dataset.

library(tidyverse)

ces_raw <- read_csv('data/raw/CES20_Common_OUTPUT_vv.csv')

This dataset has a lot of information.

dim(ces_raw)
[1] 61000   717

61,000 rows and 717 columns, to be precise. The first few rows and columns look like this:

head(ces_raw)
# A tibble: 6 × 717
   ...1    caseid commonweight commonpostweight vvweight vvweight_post
  <dbl>     <dbl>        <dbl>            <dbl>    <dbl>         <dbl>
1     1    1.23e9        0.783            0.666    0.851         0.607
2     2    1.23e9        1.34             1.44    NA            NA    
3     3    1.23e9        0.406            0.342   NA            NA    
4     4    1.23e9        0.958            0.822    1.04          1.00 
5     5    1.23e9        0.195            0.162   NA            NA    
6     6    1.23e9        1.06             0.880    1.15          0.988
# ℹ 711 more variables: tookpost <dbl>, CCEStake <dbl>,
#   birthyr <dbl>, gender <dbl>, educ <dbl>, race <dbl>,
#   race_other <chr>, hispanic <dbl>, CC20_hisp_1 <dbl>,
#   CC20_hisp_2 <dbl>, CC20_hisp_3 <dbl>, CC20_hisp_4 <dbl>,
#   CC20_hisp_5 <dbl>, CC20_hisp_6 <dbl>, CC20_hisp_7 <dbl>,
#   CC20_hisp_8 <dbl>, CC20_hisp_9 <dbl>, CC20_hisp_10 <dbl>,
#   CC20_hisp_11 <dbl>, CC20_hisp_12 <dbl>, CC20_hisp_t <chr>, …

Recoding Variables

Now that we have the dataset loaded into memory, let’s find the variables we’re interested in keeping and clean them up a bit. For example, gender is coded as a 1 or 2.

ces_raw |> 
  count(gender)
# A tibble: 2 × 2
  gender     n
   <dbl> <int>
1      1 25791
2      2 35209

Let’s make this more readable. The codebook (available at the same site where you downloaded the data) says that 1 is male and 2 is female. We can create a new version of the gender column using the mutate() and if_else() functions.

ces_clean <- ces_raw |> 
  mutate(gender = if_else(gender == 1, 'Male', 'Female'))

ces_clean |> 
  count(gender)
# A tibble: 2 × 2
  gender     n
  <chr>  <int>
1 Female 35209
2 Male   25791

What about the education variable (educ)?

ces_clean |> 
  count(educ)
# A tibble: 6 × 2
   educ     n
  <dbl> <int>
1     1  1983
2     2 16618
3     3 13330
4     4  6539
5     5 14152
6     6  8378

That’s a lot of categories to clean up. We can combine the mutate() function with case_when() to recode a bunch of different values at the same time.

ces_clean <- ces_clean |> 
  mutate(educ = case_when(educ == 1 ~ 'No HS',
                          educ == 2 ~ 'High school graduate',
                          educ == 3 ~ 'Some college',
                          educ == 4 ~ '2-year',
                          educ == 5 ~ '4-year',
                          educ == 6 ~ 'Post-grad'))
ces_clean |> 
  count(educ)
# A tibble: 6 × 2
  educ                     n
  <chr>                <int>
1 2-year                6539
2 4-year               14152
3 High school graduate 16618
4 No HS                 1983
5 Post-grad             8378
6 Some college         13330

Now the variable has the correct labels, but notice that the order in that table is kind of weird. We tend to think of education as an ordinal variable, because there’s a natural order from least to most educated. But R doesn’t know that yet. It just thinks of educ as a character variable, and lists them in alphabetical order. We can tell R about the variable order by recoding educ as a factor.

ces_clean <- ces_clean |> 
  mutate(educ = factor(educ,
                       levels = c('No HS', 
                                  'High school graduate',
                                  'Some college',
                                  '2-year',
                                  '4-year',
                                  'Post-grad')))
ces_clean |> 
  count(educ)
# A tibble: 6 × 2
  educ                     n
  <fct>                <int>
1 No HS                 1983
2 High school graduate 16618
3 Some college         13330
4 2-year                6539
5 4-year               14152
6 Post-grad             8378

Much better!

There’s a variable in CES called birthyr. What if we wanted to know the respondents age in 2020 instead? The mutate() function can help us here too.

ces_clean |> 
  count(birthyr)
# A tibble: 78 × 2
   birthyr     n
     <dbl> <int>
 1    1925     2
 2    1926     6
 3    1927    10
 4    1928    12
 5    1929    23
 6    1930    23
 7    1931    42
 8    1932    55
 9    1933    76
10    1934   109
# ℹ 68 more rows
ces_clean <- ces_clean |> 
  mutate(age = 2020 - birthyr)

ces_clean |> 
  count(birthyr, age)
# A tibble: 78 × 3
   birthyr   age     n
     <dbl> <dbl> <int>
 1    1925    95     2
 2    1926    94     6
 3    1927    93    10
 4    1928    92    12
 5    1929    91    23
 6    1930    90    23
 7    1931    89    42
 8    1932    88    55
 9    1933    87    76
10    1934    86   109
# ℹ 68 more rows

Joining Datasets

Here’s another messy feature of this dataset. Every respondent has a variable called inputstate, representing the state where they live.

ces_clean |> 
  count(inputstate)
# A tibble: 51 × 2
   inputstate     n
        <dbl> <int>
 1          1   947
 2          2   115
 3          4  1463
 4          5   536
 5          6  5035
 6          8  1061
 7          9   642
 8         10   240
 9         11   197
10         12  4615
# ℹ 41 more rows

But it’s a number! Not, like, the actual name of the state, which would be more useful to me. I don’t know all 51 state FIPS codes by heart, but fortunately I have a dataset that includes every state’s name and FIPS code.

state_fips <- tidycensus::fips_codes |> 
  mutate(inputstate = as.numeric(state_code)) |> 
  select(inputstate, state_name) |> 
  unique()

head(state_fips)
    inputstate state_name
1            1    Alabama
68           2     Alaska
97           4    Arizona
112          5   Arkansas
187          6 California
245          8   Colorado

And I can join that information with the CES data like this.

ces_clean <- ces_clean |> 
  left_join(state_fips, by = 'inputstate')

ces_clean |> 
  count(inputstate, state_name)
# A tibble: 51 × 3
   inputstate state_name               n
        <dbl> <chr>                <int>
 1          1 Alabama                947
 2          2 Alaska                 115
 3          4 Arizona               1463
 4          5 Arkansas               536
 5          6 California            5035
 6          8 Colorado              1061
 7          9 Connecticut            642
 8         10 Delaware               240
 9         11 District of Columbia   197
10         12 Florida               4615
# ℹ 41 more rows

Selecting Variables

Finally, we can use the select() function to keep only the columns we want, and drop the columns we don’t want.

ces_clean <- ces_clean |> 
  select(caseid, gender, educ, age, state_name)

head(ces_clean)
# A tibble: 6 × 5
      caseid gender educ           age state_name   
       <dbl> <chr>  <fct>        <dbl> <chr>        
1 1232318697 Male   2-year          54 Connecticut  
2 1231394915 Female Post-grad       65 Florida      
3 1232451503 Female 4-year          74 Iowa         
4 1232494791 Female 4-year          58 Massachusetts
5 1232494829 Male   4-year          53 Illinois     
6 1232473675 Male   Some college    59 Ohio         

And you can save your cleaned up version of the data to an .RData file with the save() function.

save(ces_clean, file = 'data/ces-clean.RData')

Problem Set

Just like last week, we’re going to be summarizing different variables from the CES dataset. But this time, you’ll need to use the codebook to identify which variables contain the information you want, tidy them up, and then summarize. Write an R script that performs these tasks, compile a PDF report, and submit the report.

Here are the questions I’d like you to answer:

  1. Are labor union members more likely to be Democrats or Republicans?

  2. What is the median age of people with landline phones, compared to those who only have cell phones?

  3. Are people who read the newspaper more likely to correctly answer the questions about who controls the US House of Representatives and US Senate?

  4. Which religious groups are the most and least likely to support making abortion illegal in all circumstances?

  5. Pick one other variable you find interesting and describe it.