Wrangle your data
R
programming, feel free to look over the chapters we skipped. In particular, the chapter on “Strings” if you want to work with text data, and “Dates & Times” if you want to work with time series.Up to now, we’ve been working with pretty tidy datasets. Every column is a variable, every row is an observation, and every value is where it should be. But things are not always this way. More often than you’re going to like, data comes to you an unruly mess, and you’ll need to tidy it up before you can even start to explore it.
This week, we’ll learn some of the most important functions in the tidyverse
for data importing, wrangling, and summarizing.
In your last problem set, you worked with a tidied up version of the Cooperative Election Study (CES) from 2020. Let’s look at the steps I took to import and clean up that dataset.
The CES is a survey of a representative sample of 50,000+ American adults, conducted every election year by a large consortium of universities. The data going back to 2006 is available here. Go to that site and download the 2020 dataset, a 180 megabyte monster called CES20_Common_OUTPUT_vv.csv
.
The .csv
extension at the end of the file means that it is a “comma separated value” file. This is a file where all of the values are separated by commas. Looks something like this:
name, age, party
'Joe', 35, 'Bull Moose'
'Fiona', 6, 'Birthday'
'Avery', 3, 'Know Nothing'
It’s a nice standardized way to represent datasets, and fortunately for us, the read_csv()
function allows us to read such files into R
. Let’s create an object called ces_raw
from that raw dataset.
This dataset has a lot of information.
dim(ces_raw)
[1] 61000 717
61,000 rows and 717 columns, to be precise. The first few rows and columns look like this:
head(ces_raw)
# A tibble: 6 × 717
...1 caseid commonweight commonpostweight vvweight vvweight_post
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1.23e9 0.783 0.666 0.851 0.607
2 2 1.23e9 1.34 1.44 NA NA
3 3 1.23e9 0.406 0.342 NA NA
4 4 1.23e9 0.958 0.822 1.04 1.00
5 5 1.23e9 0.195 0.162 NA NA
6 6 1.23e9 1.06 0.880 1.15 0.988
# ℹ 711 more variables: tookpost <dbl>, CCEStake <dbl>,
# birthyr <dbl>, gender <dbl>, educ <dbl>, race <dbl>,
# race_other <chr>, hispanic <dbl>, CC20_hisp_1 <dbl>,
# CC20_hisp_2 <dbl>, CC20_hisp_3 <dbl>, CC20_hisp_4 <dbl>,
# CC20_hisp_5 <dbl>, CC20_hisp_6 <dbl>, CC20_hisp_7 <dbl>,
# CC20_hisp_8 <dbl>, CC20_hisp_9 <dbl>, CC20_hisp_10 <dbl>,
# CC20_hisp_11 <dbl>, CC20_hisp_12 <dbl>, CC20_hisp_t <chr>, …
Now that we have the dataset loaded into memory, let’s find the variables we’re interested in keeping and clean them up a bit. For example, gender is coded as a 1 or 2.
ces_raw |>
count(gender)
# A tibble: 2 × 2
gender n
<dbl> <int>
1 1 25791
2 2 35209
Let’s make this more readable. The codebook (available at the same site where you downloaded the data) says that 1 is male and 2 is female. We can create a new version of the gender column using the mutate()
and if_else()
functions.
ces_clean <- ces_raw |>
mutate(gender = if_else(gender == 1, 'Male', 'Female'))
ces_clean |>
count(gender)
# A tibble: 2 × 2
gender n
<chr> <int>
1 Female 35209
2 Male 25791
What about the education variable (educ
)?
ces_clean |>
count(educ)
# A tibble: 6 × 2
educ n
<dbl> <int>
1 1 1983
2 2 16618
3 3 13330
4 4 6539
5 5 14152
6 6 8378
That’s a lot of categories to clean up. We can combine the mutate()
function with case_when()
to recode a bunch of different values at the same time.
ces_clean <- ces_clean |>
mutate(educ = case_when(educ == 1 ~ 'No HS',
educ == 2 ~ 'High school graduate',
educ == 3 ~ 'Some college',
educ == 4 ~ '2-year',
educ == 5 ~ '4-year',
educ == 6 ~ 'Post-grad'))
ces_clean |>
count(educ)
# A tibble: 6 × 2
educ n
<chr> <int>
1 2-year 6539
2 4-year 14152
3 High school graduate 16618
4 No HS 1983
5 Post-grad 8378
6 Some college 13330
Now the variable has the correct labels, but notice that the order in that table is kind of weird. We tend to think of education as an ordinal variable, because there’s a natural order from least to most educated. But R
doesn’t know that yet. It just thinks of educ
as a character variable, and lists them in alphabetical order. We can tell R
about the variable order by recoding educ
as a factor.
ces_clean <- ces_clean |>
mutate(educ = factor(educ,
levels = c('No HS',
'High school graduate',
'Some college',
'2-year',
'4-year',
'Post-grad')))
ces_clean |>
count(educ)
# A tibble: 6 × 2
educ n
<fct> <int>
1 No HS 1983
2 High school graduate 16618
3 Some college 13330
4 2-year 6539
5 4-year 14152
6 Post-grad 8378
Much better!
There’s a variable in CES called birthyr
. What if we wanted to know the respondents age in 2020 instead? The mutate()
function can help us here too.
ces_clean |>
count(birthyr)
# A tibble: 78 × 2
birthyr n
<dbl> <int>
1 1925 2
2 1926 6
3 1927 10
4 1928 12
5 1929 23
6 1930 23
7 1931 42
8 1932 55
9 1933 76
10 1934 109
# ℹ 68 more rows
# A tibble: 78 × 3
birthyr age n
<dbl> <dbl> <int>
1 1925 95 2
2 1926 94 6
3 1927 93 10
4 1928 92 12
5 1929 91 23
6 1930 90 23
7 1931 89 42
8 1932 88 55
9 1933 87 76
10 1934 86 109
# ℹ 68 more rows
Here’s another messy feature of this dataset. Every respondent has a variable called inputstate
, representing the state where they live.
ces_clean |>
count(inputstate)
# A tibble: 51 × 2
inputstate n
<dbl> <int>
1 1 947
2 2 115
3 4 1463
4 5 536
5 6 5035
6 8 1061
7 9 642
8 10 240
9 11 197
10 12 4615
# ℹ 41 more rows
But it’s a number! Not, like, the actual name of the state, which would be more useful to me. I don’t know all 51 state FIPS codes by heart, but fortunately I have a dataset that includes every state’s name and FIPS code.
state_fips <- tidycensus::fips_codes |>
mutate(inputstate = as.numeric(state_code)) |>
select(inputstate, state_name) |>
unique()
head(state_fips)
inputstate state_name
1 1 Alabama
68 2 Alaska
97 4 Arizona
112 5 Arkansas
187 6 California
245 8 Colorado
And I can join that information with the CES data like this.
ces_clean <- ces_clean |>
left_join(state_fips, by = 'inputstate')
ces_clean |>
count(inputstate, state_name)
# A tibble: 51 × 3
inputstate state_name n
<dbl> <chr> <int>
1 1 Alabama 947
2 2 Alaska 115
3 4 Arizona 1463
4 5 Arkansas 536
5 6 California 5035
6 8 Colorado 1061
7 9 Connecticut 642
8 10 Delaware 240
9 11 District of Columbia 197
10 12 Florida 4615
# ℹ 41 more rows
Finally, we can use the select()
function to keep only the columns we want, and drop the columns we don’t want.
# A tibble: 6 × 5
caseid gender educ age state_name
<dbl> <chr> <fct> <dbl> <chr>
1 1232318697 Male 2-year 54 Connecticut
2 1231394915 Female Post-grad 65 Florida
3 1232451503 Female 4-year 74 Iowa
4 1232494791 Female 4-year 58 Massachusetts
5 1232494829 Male 4-year 53 Illinois
6 1232473675 Male Some college 59 Ohio
And you can save your cleaned up version of the data to an .RData file with the save()
function.
save(ces_clean, file = 'data/ces-clean.RData')
Just like last week, we’re going to be summarizing different variables from the CES dataset. But this time, you’ll need to use the codebook to identify which variables contain the information you want, tidy them up, and then summarize. Write an R script that performs these tasks, compile a PDF report, and submit the report.
Here are the questions I’d like you to answer:
Are labor union members more likely to be Democrats or Republicans?
What is the median age of people with landline phones, compared to those who only have cell phones?
Are people who read the newspaper more likely to correctly answer the questions about who controls the US House of Representatives and US Senate?
Which religious groups are the most and least likely to support making abortion illegal in all circumstances?
Pick one other variable you find interesting and describe it.