library(tidyverse)
<- read_csv('data/raw/CES20_Common_OUTPUT_vv.csv') ces_raw
Week 10: Data Wrangling
Up to now, we’ve been working with pretty tidy datasets. Every column is a variable, every row is an observation, and every value is where it should be. But things are not always this way. More often than you’re going to like, data comes to you an unruly mess, and you’ll need to tidy it up before you can even start to explore it.
This week, we’ll learn some of the most important functions in the tidyverse
for data wrangling. By the end of the week, you will be able to:
Filter the rows of a dataset using conditional logic with the
filter()
functionSelect and rename columns using the
select()
functionCreate new variables and modify existing ones using the
mutate()
functionProperly format categorical variables, with the help of functions like
case_when()
andfactor()
Perform a basic data merge with
left_join()
Reading
R4DS Chapters 3-4
R4DS Chapter 12
Problem Set
This week, we’re going to be summarizing different variables from the CES 2020 dataset. You’ll need to download the dataset (see instructions in the class notes below) and use the codebook to identify which variables contain the information you want, tidy them up, and then summarize. Write an R
script that performs these tasks, compile a PDF report, and submit the report.
Here are the questions I’d like you to answer:
Are labor union members more likely to be Democrats or Republicans?
Compare the average age of people with landline phones, compared to those who only have cell phones? Estimate the difference-in-means and report the p-value from a null hypothesis test.
Are people who read the newspaper more likely to correctly answer the questions about who controls the US House of Representatives and US Senate? (Remember, this survey was conducted at the end of 2020.)
Create a bar chart reporting the percent of respondents in each religious group that support making abortion illegal in all circumstances.
Bonus. The CES oversamples some respondents relative to their frequency among registered voters in order to get a better statistical picture of some smaller subgroups. The dataset contains a weight variable for each respondent, reflecting how much that type of respondent is overrepresented in the sample compared to the population. Compare the mean age of the sample with the weighted mean age of the sample. What do you notice? Do the same thing for the unweighted and weighted shares of each racial group. Redo the analyses in questions 1-4 using weighted means. What conclusions, if any, change?
Class Notes
Previously, we have worked with a tidied up version of the Cooperative Election Study (CES) from 2020. To illustrate the process of data wrangling, let’s look at the steps I took to import and clean up that dataset.
Importing The Data
The CES is a survey of a representative sample of 50,000+ American adults, conducted every election year by a large consortium of universities. The data going back to 2006 is available here. Go to that site and download the 2020 dataset, a 180 megabyte monster called CES20_Common_OUTPUT_vv.csv
.
The .csv
extension at the end of the file means that it is a “comma separated value” file. This is a file where all of the values are separated by commas. Looks something like this:
name, age, party
'Joe', 35, 'Bull Moose'
'Fiona', 6, 'Birthday'
'Avery', 3, 'Know Nothing'
It’s a nice standardized way to represent datasets, and fortunately for us, the read_csv()
function allows us to read such files into R
.1 Let’s create an object called ces_raw
from that raw dataset.
This dataset has a lot of information.
dim(ces_raw)
[1] 61000 717
61,000 rows and 717 columns, to be precise. The first few rows and columns look like this:
head(ces_raw)
# A tibble: 6 × 717
...1 caseid commonweight commonpostweight vvweight vvweight_post tookpost
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1232318697 0.783 0.666 0.851 0.607 2
2 2 1231394915 1.34 1.44 NA NA 2
3 3 1232451503 0.406 0.342 NA NA 2
4 4 1232494791 0.958 0.822 1.04 1.00 2
5 5 1232494829 0.195 0.162 NA NA 2
6 6 1232473675 1.06 0.880 1.15 0.988 2
# ℹ 710 more variables: CCEStake <dbl>, birthyr <dbl>, gender <dbl>,
# educ <dbl>, race <dbl>, race_other <chr>, hispanic <dbl>,
# CC20_hisp_1 <dbl>, CC20_hisp_2 <dbl>, CC20_hisp_3 <dbl>, CC20_hisp_4 <dbl>,
# CC20_hisp_5 <dbl>, CC20_hisp_6 <dbl>, CC20_hisp_7 <dbl>, CC20_hisp_8 <dbl>,
# CC20_hisp_9 <dbl>, CC20_hisp_10 <dbl>, CC20_hisp_11 <dbl>,
# CC20_hisp_12 <dbl>, CC20_hisp_t <chr>, CC20_asian_1 <dbl>,
# CC20_asian_2 <dbl>, CC20_asian_3 <dbl>, CC20_asian_4 <dbl>, …
Recoding Variables
Now that we have the dataset loaded into memory, let’s find the variables we’re interested in keeping and clean them up a bit. For example, gender is coded as a 1 or 2.
|>
ces_raw count(gender)
# A tibble: 2 × 2
gender n
<dbl> <int>
1 1 25791
2 2 35209
Let’s make this more readable. The codebook (available at the same site where you downloaded the data) says that 1 is male and 2 is female. We can create a new version of the gender column using the mutate()
and if_else()
functions.
<- ces_raw |>
ces_clean mutate(gender = if_else(gender == 1, 'Male', 'Female'))
|>
ces_clean count(gender)
# A tibble: 2 × 2
gender n
<chr> <int>
1 Female 35209
2 Male 25791
What about the education variable (educ
)?
|>
ces_clean count(educ)
# A tibble: 6 × 2
educ n
<dbl> <int>
1 1 1983
2 2 16618
3 3 13330
4 4 6539
5 5 14152
6 6 8378
That’s a lot of categories to clean up. We can combine the mutate()
function with case_when()
to recode a bunch of different values at the same time.
<- ces_clean |>
ces_clean mutate(educ = case_when(educ == 1 ~ 'No HS',
== 2 ~ 'High school graduate',
educ == 3 ~ 'Some college',
educ == 4 ~ '2-year',
educ == 5 ~ '4-year',
educ == 6 ~ 'Post-grad'))
educ |>
ces_clean count(educ)
# A tibble: 6 × 2
educ n
<chr> <int>
1 2-year 6539
2 4-year 14152
3 High school graduate 16618
4 No HS 1983
5 Post-grad 8378
6 Some college 13330
Now the variable has the correct labels, but notice that the order in that table is kind of weird. We tend to think of education as an ordinal variable, because there’s a natural order from least to most educated. But R
doesn’t know that yet. It just thinks of educ
as a character variable, and lists them in alphabetical order. We can tell R
about the variable order by recoding educ
as a factor.
<- ces_clean |>
ces_clean mutate(educ = factor(educ,
levels = c('No HS',
'High school graduate',
'Some college',
'2-year',
'4-year',
'Post-grad')))
|>
ces_clean count(educ)
# A tibble: 6 × 2
educ n
<fct> <int>
1 No HS 1983
2 High school graduate 16618
3 Some college 13330
4 2-year 6539
5 4-year 14152
6 Post-grad 8378
Much better!
There’s a variable in CES called birthyr
. What if we wanted to know the respondents age in 2020 instead? The mutate()
function can help us here too.
|>
ces_clean count(birthyr)
# A tibble: 78 × 2
birthyr n
<dbl> <int>
1 1925 2
2 1926 6
3 1927 10
4 1928 12
5 1929 23
6 1930 23
7 1931 42
8 1932 55
9 1933 76
10 1934 109
# ℹ 68 more rows
<- ces_clean |>
ces_clean mutate(age = 2020 - birthyr)
|>
ces_clean count(birthyr, age)
# A tibble: 78 × 3
birthyr age n
<dbl> <dbl> <int>
1 1925 95 2
2 1926 94 6
3 1927 93 10
4 1928 92 12
5 1929 91 23
6 1930 90 23
7 1931 89 42
8 1932 88 55
9 1933 87 76
10 1934 86 109
# ℹ 68 more rows
Joining Datasets
Here’s another messy feature of this dataset. Every respondent has a variable called inputstate
, representing the state where they live.
|>
ces_clean count(inputstate)
# A tibble: 51 × 2
inputstate n
<dbl> <int>
1 1 947
2 2 115
3 4 1463
4 5 536
5 6 5035
6 8 1061
7 9 642
8 10 240
9 11 197
10 12 4615
# ℹ 41 more rows
But it’s a number! Not, like, the actual name of the state, which would be more useful to me. I don’t know all 51 state FIPS codes by heart, but fortunately I have a dataset that includes every state’s name and FIPS code.
<- tidycensus::fips_codes |>
state_fips mutate(inputstate = as.numeric(state_code)) |>
select(inputstate, state_name) |>
unique()
head(state_fips)
inputstate state_name
1 1 Alabama
68 2 Alaska
97 4 Arizona
112 5 Arkansas
187 6 California
245 8 Colorado
And I can join that information with the CES data like this.
<- ces_clean |>
ces_clean left_join(state_fips, by = 'inputstate')
|>
ces_clean count(inputstate, state_name)
# A tibble: 51 × 3
inputstate state_name n
<dbl> <chr> <int>
1 1 Alabama 947
2 2 Alaska 115
3 4 Arizona 1463
4 5 Arkansas 536
5 6 California 5035
6 8 Colorado 1061
7 9 Connecticut 642
8 10 Delaware 240
9 11 District of Columbia 197
10 12 Florida 4615
# ℹ 41 more rows
We will explore joining datasets more deeply in two weeks.
Selecting Variables
Finally, we can use the select()
function to keep only the columns we want, and drop the columns we don’t want.
<- ces_clean |>
ces_clean select(caseid, gender, educ, age, state_name)
head(ces_clean)
# A tibble: 6 × 5
caseid gender educ age state_name
<dbl> <chr> <fct> <dbl> <chr>
1 1232318697 Male 2-year 54 Connecticut
2 1231394915 Female Post-grad 65 Florida
3 1232451503 Female 4-year 74 Iowa
4 1232494791 Female 4-year 58 Massachusetts
5 1232494829 Male 4-year 53 Illinois
6 1232473675 Male Some college 59 Ohio
And you can save your cleaned up version of the data to an .RData file with the save()
function.
save(ces_clean, file = 'data/ces-clean.RData')
Additional Resources
- Wrangling Penguins, by Allison Horst
Footnotes
I will generally use
read_csv()
instead ofread.csv()
in my own work, since it is faster and will give you useful error messages if the input is formatted in a way it doesn’t expect.↩︎