::install_github('joeornstein/SRP')
devtools#> digest (0.6.36 -> 0.6.37 ) [CRAN]
#> data.table (1.15.4 -> 1.16.0 ) [CRAN]
#> cpp11 (0.4.7 -> 0.5.0 ) [CRAN]
#> timeDate (4032.109 -> 4041.110 ) [CRAN]
#> openssl (2.2.0 -> 2.2.2 ) [CRAN]
#> jsonlite (1.8.8 -> 1.8.9 ) [CRAN]
#> curl (5.2.1 -> 5.2.3 ) [CRAN]
#> ps (1.7.7 -> 1.8.0 ) [CRAN]
#> tinytex (0.52 -> 0.53 ) [CRAN]
#> xfun (0.46 -> 0.47 ) [CRAN]
#> evaluate (0.24.0 -> 1.0.0 ) [CRAN]
#> rmarkdown (2.27 -> 2.28 ) [CRAN]
#> bit (4.0.5 -> 4.5.0 ) [CRAN]
#> bit64 (4.0.5 -> 4.5.2 ) [CRAN]
#> wrapr (NA -> 2.1.0 ) [CRAN]
#> RcppEigen (0.3.4.0.0 -> 0.3.4.0.2) [CRAN]
#> minqa (1.2.7 -> 1.2.8 ) [CRAN]
#> ragg (1.3.2 -> 1.3.3 ) [CRAN]
#> quadprog (NA -> 1.5-8 ) [CRAN]
#> vtreat (NA -> 1.6.5 ) [CRAN]
#> xgboost (NA -> 1.7.8.1 ) [CRAN]
#> kknn (NA -> 1.3.1 ) [CRAN]
#> glmnet (NA -> 4.1-8 ) [CRAN]
#> package 'digest' successfully unpacked and MD5 sums checked
#> package 'data.table' successfully unpacked and MD5 sums checked
#> package 'cpp11' successfully unpacked and MD5 sums checked
#> package 'timeDate' successfully unpacked and MD5 sums checked
#> package 'openssl' successfully unpacked and MD5 sums checked
#> package 'jsonlite' successfully unpacked and MD5 sums checked
#> package 'curl' successfully unpacked and MD5 sums checked
#> package 'ps' successfully unpacked and MD5 sums checked
#> package 'tinytex' successfully unpacked and MD5 sums checked
#> package 'xfun' successfully unpacked and MD5 sums checked
#> package 'evaluate' successfully unpacked and MD5 sums checked
#> package 'rmarkdown' successfully unpacked and MD5 sums checked
#> package 'bit' successfully unpacked and MD5 sums checked
#> package 'bit64' successfully unpacked and MD5 sums checked
#> package 'wrapr' successfully unpacked and MD5 sums checked
#> package 'RcppEigen' successfully unpacked and MD5 sums checked
#> package 'minqa' successfully unpacked and MD5 sums checked
#> package 'ragg' successfully unpacked and MD5 sums checked
#> package 'quadprog' successfully unpacked and MD5 sums checked
#> package 'vtreat' successfully unpacked and MD5 sums checked
#> package 'xgboost' successfully unpacked and MD5 sums checked
#> package 'kknn' successfully unpacked and MD5 sums checked
#> package 'glmnet' successfully unpacked and MD5 sums checked
#>
#> The downloaded binary packages are in
#> C:\Users\jo22058\AppData\Local\Temp\RtmpC24Kvj\downloaded_packages
#> ── R CMD build ─────────────────────────────────────────────────────────────────
#> * checking for file 'C:\Users\jo22058\AppData\Local\Temp\RtmpC24Kvj\remotes73a06a3739cc\joeornstein-SRP-9cd9434/DESCRIPTION' ... OK
#> * preparing 'SRP':
#> * checking DESCRIPTION meta-information ... OK
#> * checking for LF line-endings in source and make files and shell scripts
#> * checking for empty or unneeded directories
#> * building 'SRP_0.2.0.tar.gz'
#>
library(SRP)
library(tidyverse)
The SRP Package
SRP is an R
package that contains useful functions for implementing multilevel regression and poststratification (MRP) and stacked regression and poststratification (SRP).
Motivation
Suppose we want to know how some public opinion varies by subnational unit. But we don’t have surveys that were conducted at the unit-level, only a national-level survey. How can we use the information from the national survey to make inferences about the subnational level? This vignette walks through three techniques for doing so: disaggregation, multilevel regression and poststratification (MRP), and stacked regression and poststratification (SRP). It concludes with an introduction to synthetic poststratification.
Installation
The SRP
package is currently available on GitHub. You can install it using the devtools
package. You’ll also want to load the tidyverse
library for this vignette.
The Data
The dataset (trainset
) contains individual-level data on public opinion and demographic characteristics. It is simulated data, generated from the Monte Carlo in Ornstein (2020). It contains the following variables:
<- SRP::vignetteData
trainset
trainset#> # A tibble: 3,000 × 8
#> ID y x1 x2 unit latitude longitude unit_covariate
#> <int> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 638233 -2.68 2 1 43 0.663 0.140 2.22
#> 2 1909280 2.01 3 3 128 0.987 0.504 2.64
#> 3 2513825 -4.36 1 2 168 0.752 0.746 2.84
#> 4 2968824 -0.989 2 3 198 0.764 0.190 3.27
#> 5 258017 4.18 1 1 18 0.545 0.897 2.00
#> 6 1252503 -4.54 3 3 84 0.770 0.195 2.43
#> 7 1414783 -3.63 2 2 95 0.518 0.0937 2.47
#> 8 602889 3.56 3 3 41 0.162 0.559 2.20
#> 9 157609 0.271 2 2 11 0.431 0.554 1.93
#> 10 2610832 5.99 4 4 175 0.898 0.932 2.91
#> # ℹ 2,990 more rows
The variable \(y\) is our outcome of interest, \(x_1\) and \(x_2\) are individual-level covariates, unit
is the subnational unit ID, and latitude
, longitude
, and unit_covariate
are characteristics of the subnational unit.
Disaggregation
Disaggregation is the most straightforward method to estimate. Simply take the unit-level means from the national survey. Note, however, that the number of observations within each unit is fairly small. As a result, disaggregation is unlikely to yield good estimates. This is why we adopt a model-based approach.
<- trainset %>%
disag_estimates group_by(unit) %>%
summarise(disag_estimate = mean(y),
num = n())
disag_estimates#> # A tibble: 200 × 3
#> unit disag_estimate num
#> <int> <dbl> <int>
#> 1 1 -6.15 14
#> 2 2 2.61 14
#> 3 3 -4.10 16
#> 4 4 -3.45 15
#> 5 5 -3.63 10
#> 6 6 -5.40 21
#> 7 7 -2.74 15
#> 8 8 -2.95 18
#> 9 9 -1.61 15
#> 10 10 -1.48 16
#> # ℹ 190 more rows
MRP
Multilevel regression and poststratification (MRP) was introduced by Gelman and Little (1997) and refined by Park, Gelman, and Bafumi (2004). It proceeds in two steps:
- Estimate a multilevel regression, predicting opinion using observed individual-level and unit-level covariates.
- Poststratify by taking the mean of each group’s prediction weighted by their frequency in the subnational unit.
We can estimate the first-stage regression using the lme4
package.
library(lme4)
<- lmer(y ~ (1|x1) + (1|x2) + unit_covariate + (1|unit), data = trainset) model1
For the second stage, we need a poststratification frame. For the SRP
package, it should come in the following format.
<- SRP::vignettePSFrame
PSFrame
PSFrame#> # A tibble: 3,200 × 7
#> unit x1 x2 freq unit_covariate latitude longitude
#> <int> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 1 1 5482 1.54 0.623 0.0647
#> 2 1 1 2 2497 1.54 0.623 0.0647
#> 3 1 1 3 499 1.54 0.623 0.0647
#> 4 1 1 4 43 1.54 0.623 0.0647
#> 5 1 2 1 2418 1.54 0.623 0.0647
#> 6 1 2 2 1817 1.54 0.623 0.0647
#> 7 1 2 3 603 1.54 0.623 0.0647
#> 8 1 2 4 58 1.54 0.623 0.0647
#> 9 1 3 1 557 1.54 0.623 0.0647
#> 10 1 3 2 590 1.54 0.623 0.0647
#> # ℹ 3,190 more rows
Each row reports the empirical frequency for each unique combination of individual-level characteristics, repeated for each subnational unit. For example, the first row reports that there are 5482 individuals with \(x_1 = 1\) and \(x_2 = 1\) in Unit 1.
Once we have both pieces of information – the predictions and the frequencies – poststratification simply requires taking a weighted average, using the poststratify
function. Note that this function requires your poststratification frame to have two particular variables:
unit
: the identity of the subnational unit.freq
: the empirical frequency for each cell.
<- predict(model1, PSFrame, allow.new.levels = T)
pred
<- poststratify(pred, PSFrame)
mrp_estimates
mrp_estimates#> # A tibble: 200 × 2
#> unit poststratifiedEstimate
#> <int> <dbl>
#> 1 1 -3.54
#> 2 2 -1.38
#> 3 3 -2.72
#> 4 4 -2.49
#> 5 5 -2.39
#> 6 6 -2.96
#> 7 7 -2.06
#> 8 8 -2.27
#> 9 9 -1.76
#> 10 10 -1.66
#> # ℹ 190 more rows
SRP
Stacked regression and poststratification (SRP) proceeds in the same fashion as MRP, but the first-stage predictions come from an ensemble model average generated through stacking. See Ornstein (2020) for technical details.
To start, we must tune and estimate each of the component models separately. The following code estimates a hierarchical linear regression model, LASSO, random forest, KNN, and gradient boosting.
library(glmnet)
library(ranger)
library(kknn)
library(xgboost)
library(caret)
#Estimate HLM
<- y ~ (1|x1) + (1|x2) + unit_covariate + (1|unit)
hlmFormula <- lmer(hlmFormula, data = trainset)
hlmModel
#Tune LASSO
<- c("x1","x2","unit","unit_covariate")
lasso_vars <- c('x1', 'x2', 'unit') #which variables to convert to factors
lasso_factors <- cleanDataLASSO(trainset, lasso_vars, lasso_factors)$trainset
trainset_lasso
<- cv.glmnet(trainset_lasso, trainset$y,
lassoModel type.measure = "mse")
#Tune KNN
<- y ~ x1 + x2 + latitude + longitude + unit_covariate
knnFormula <- train.kknn(knnFormula, data=trainset, kmax = 201) #Find best k (LOOCV)
knn_train <- knn_train$best.parameters$k
k_best
#Tune Random Forest
<- y ~ x1 + x2 + latitude + longitude + unit_covariate
forestFormula <- ranger(forestFormula, data = trainset)
forestModel
#Tune GBM
<- c("x1","x2","latitude","longitude","unit_covariate")
gbm_vars <- cleanDataGBM(trainset=trainset, gbm_vars=gbm_vars)$trainset
trainset_gbm #Create a custom 'xgb.DMatrix'. Faster computation
<- xgb.DMatrix(trainset_gbm, label = trainset$y)
dtrain
#5-fold cross-validation; pick nrounds that minimizes RMSE
<- xgb.cv(data = dtrain,
xgb.tune booster = "gbtree",
objective = "reg:squarederror",
eval_metric = "rmse",
eta = 0.02,
nrounds = 50 / 0.02, #Lower eta -> more trees
nfold = 5,
verbose = F,
early_stopping_rounds = 20)
<- xgboost(data = dtrain,
gbmModel booster = "gbtree",
objective = "reg:squarederror",
eval_metric = "rmse",
eta = 0.02,
verbose = F,
nrounds = xgb.tune$best_iteration)
Next, we will use the getStackWeights()
function to estimate the optimal ensemble model average weights using 5-fold cross-validation.
<- getStackWeights(trainset = trainset,
stackWeights hlmFormula = hlmFormula,
lasso_vars = lasso_vars,
lasso_factors = lasso_factors,
forestFormula = forestFormula,
knnFormula = knnFormula, k_best = k_best,
gbm_vars = gbm_vars, gbm_factors = NULL,
gbm_params = list(eta = 0.02), gbm_tune = xgb.tune,
nfolds = 5)
%>% round(3)
stackWeights #> [1] 0.312 0.000 0.163 0.000 0.525
Then we can poststratify as before.
<- cleanDataLASSO(PSFrame, lasso_vars = lasso_vars, lasso_factors = lasso_factors,
PSFrame_lasso new_vars_lasso = colnames(trainset_lasso))$trainset
<- cleanDataGBM(PSFrame, gbm_vars = gbm_vars)$trainset
PSFrame_gbm
<- predict(hlmModel, PSFrame, allow.new.levels = T)
M1 <- predict(lassoModel, newx = PSFrame_lasso, s = lassoModel$lambda.min)
M2 <- kknn(knnFormula, train = trainset, test = PSFrame, k = k_best)$fitted.values
M3 <- predict(forestModel, PSFrame)$predictions
M4 <- predict(gbmModel, PSFrame_gbm)
M5 <- cbind(M1,M2,M3,M4,M5) %>% as.matrix
M
<- M %*% stackWeights
pred
#Poststratify
<- poststratify(pred, PSFrame)
srp_estimates
head(srp_estimates)
#> # A tibble: 6 × 2
#> unit poststratifiedEstimate
#> <int> <dbl>
#> 1 1 -0.239
#> 2 2 0.828
#> 3 3 -3.18
#> 4 4 -2.96
#> 5 5 0.302
#> 6 6 -2.12
Results
Because the data came from a simulation, we also know the true unit-level means. Let’s see how our estimates compare.
Synthetic Poststratification
What if you do not have the joint frequency distribution for all your predictor variables at the subnational level? Leemann and Wasserfallen (2017) propose a method that instead uses marginal frequency distributions called synthetic poststratification. The approach proceeds by multiplying the marginal probabilities to create a synethtic joint distribution, assuming that the predictor variables are statistically independent.
Note that this is a strong assumption. However, if the first-stage model is additively-separable, then both classical and synthetic poststratification produce identical results (see Appendix A in Ornstein (2020) for the proof). This implies that Multilevel Regression and Synthetic Poststratification (MrsP) can produce strictly superior estimates when the first-stage model is linear-additive, because one can include more predictor variables.
The SRP
package provides a function that can generate synthetic poststratification frames, called getSyntheticPSFrame()
.
Using the Function
Suppose you have two (non-synthetic) frequency distributions describing the same population.
<- SRP::race
PSFrame1 <- SRP::education
PSFrame2
PSFrame1#> # A tibble: 8 × 3
#> unit race freq
#> <int> <int> <int>
#> 1 1 1 100
#> 2 1 2 200
#> 3 1 3 300
#> 4 1 4 400
#> 5 2 1 200
#> 6 2 2 100
#> 7 2 3 300
#> 8 2 4 400
PSFrame2#> # A tibble: 6 × 3
#> unit education freq
#> <int> <int> <int>
#> 1 1 1 250
#> 2 1 2 300
#> 3 1 3 450
#> 4 2 1 450
#> 5 2 2 350
#> 6 2 3 200
To get the synthetic joint distribution, just call getSyntheticPSFrame()
. Note that the resulting output is consistent with the marginal frequency distributions; add up all the observations with race == 1 in and it should yield the same frequency from PSFrame1
.
<- getSyntheticPSFrame(PSFrame1, PSFrame2)
PSFrame
PSFrame#> # A tibble: 24 × 4
#> unit race education freq
#> <int> <int> <int> <dbl>
#> 1 1 1 1 25
#> 2 1 1 2 30
#> 3 1 1 3 45
#> 4 1 2 1 50
#> 5 1 2 2 60
#> 6 1 2 3 90
#> 7 1 3 1 75
#> 8 1 3 2 90
#> 9 1 3 3 135
#> 10 1 4 1 100
#> # ℹ 14 more rows
If you want to generate a synthetic poststratification frame from more than one marginal distribution, simply repeat the process.
<- SRP::sex
PSFrame3
PSFrame3#> # A tibble: 4 × 3
#> unit sex freq
#> <int> <int> <int>
#> 1 1 1 600
#> 2 1 2 400
#> 3 2 1 500
#> 4 2 2 500
<- getSyntheticPSFrame(PSFrame, PSFrame3)
PSFrame
PSFrame#> # A tibble: 48 × 5
#> unit race education sex freq
#> <int> <int> <int> <int> <dbl>
#> 1 1 1 1 1 15
#> 2 1 1 1 2 10
#> 3 1 1 2 1 18
#> 4 1 1 2 2 12
#> 5 1 1 3 1 27
#> 6 1 1 3 2 18
#> 7 1 2 1 1 30
#> 8 1 2 1 2 20
#> 9 1 2 2 1 36
#> 10 1 2 2 2 24
#> # ℹ 38 more rows