library(tidyverse)
library(ggplot2)
library(survey)
library(nonprobsvy)
library(PracTools)
Inference with nonprobability samples
Introduction
This material is part of the Utrecht Summer School S16 - Survey Research: Statistical Analysis and Estimation.
In this practical session we learn how to do inference with nonprobability samples; in the first part we discuss frequentist approaches (Inverse probability weighting, mass imputation and doubly robust methods) and in the second part the Bayesian MRP.
Frequentist approaches
In these exercises we use the nonprobsvy
package developed by Maciej Beręsewicz and Łukasz Chrostowski. You can find additional information about the package here:
Modern inference methods for non-probability samples with R;
GitHub repository;
A video of a workshop on nonprobsvy.
The data
In this exercise, we will analyze data from the 2003 U.S. Michigan Behavioral Risk Factor Surveillance Survey, available in the PracTools
package. The dataset, named mibrfss
, includes various demographic and health-related variables. We use this dataset as the basis in order to simulate a population of size 100000
and different sampling scenarios. In particular, the data we provide you with are:
Population Totals:
pop_tot.rda
– Contains population totals.Probability Sample (PS):
p_sample.rda
– A high-quality sample of size 10,000, assumed to be a simple random sample.Non-Probability Sample (NPS):
np_sample.rda
– A sample of size 1,000 with self-selection and coverage issues.
We consider a subset of the variables available in the mibrfss
dataset, namely:
SMOKE100
: Smoked 100 or more cigarettes in lifetime (1 = Yes; 2 = No)AGECAT
: Age group (1 = 18-24 years; 2 = 25-34 years; 3 = 35-44 years; 4 = 45-54 years; 5 = 55-64 years; 6 = 65+)INETHOME
: Has access to the Internet at home (1 = Yes; 2 = No)EDCAT
: Education level (1 = Did not graduate high school; 2 = Graduated high school; 3 = Attended college or technical school; 4 = Graduated from college or technical school)
For this exercise, we assume that the non-probability sample (NPS) was collected through social media advertisements. This approach introduces potential biases because only individuals with internet access and who are active on the specific social media platform could participate. Consequently, the NPS may suffer from both self-selection bias and coverage bias (since the sample is restricted to those with internet access and a presence on social media).
In contrast, we assume that the probability sample (PS) was obtained through a simple random sampling method. This sample is considered a high-quality reference sample, and for simplicity, we do not need to account for any design or calibration weights.
Estimate the proportion of U.S. individuals aged 18+ who smoked 100 or more cigarettes in their lifetime
Compare different estimation techniques against the true population value of 52.85%
Preliminary analyses
- Before starting, load the necessary R packages:
- Import and prepare the data: Load the datasets and inspect the data. Pay attention to the data format.
load("p_sample.rda")
load("np_sample.rda")
load("pop_tot.rda")
Task: Familiarize yourself with the datasets.
pop_tot
(Intercept) AGECAT2 AGECAT3 AGECAT4 AGECAT5 AGECAT6
100000 13408 19554 22526 16875 21879
RACECAT2 RACECAT3 EDCAT2 EDCAT3 EDCAT4 INETHOME1
8255 5113 30925 28464 31668 65370
Compare the distribution of the covariates in the probability and nonprobability samples:
You can use
prop.table
.
# Age
prop.table(table(p_sample$AGECAT))
1 2 3 4 5 6
0.0564 0.1351 0.1992 0.2211 0.1718 0.2164
prop.table(table(np_sample$AGECAT))
1 2 3 4 5 6
0.126 0.143 0.219 0.241 0.152 0.119
# Race
prop.table(table(p_sample$RACECAT))
1 2 3
0.8699 0.0792 0.0509
prop.table(table(np_sample$RACECAT))
1 2 3
0.878 0.061 0.061
# Education
prop.table(table(p_sample$EDCAT))
1 2 3 4
0.0875 0.3109 0.2801 0.3215
prop.table(table(np_sample$EDCAT))
1 2 3 4
0.111 0.256 0.267 0.366
# Internet at home
prop.table(table(p_sample$INETHOME))
0 1
0.35 0.65
prop.table(table(np_sample$INETHOME))
0 1
0.04 0.96
Individuals in the nonprobability samples are more young, with more graduated from collage or technical school higher education and mainly with internet at home.
- Define the sampling desing: Specify the sampling designs for both the PS and NPS samples using the
svydesign
function.
<- svydesign(ids = ~ 1,data = p_sample) ps_design
Warning in svydesign.default(ids = ~1, data = p_sample): No weights or
probabilities supplied, assuming equal probability
<- svydesign(ids = ~ 1,data = np_sample) nps_design
Warning in svydesign.default(ids = ~1, data = np_sample): No weights or
probabilities supplied, assuming equal probability
- Estimate Proportion from Non-Probability Sample
Use the svymean
and/or svyciprop
functions to calculate the proportion and its confidence interval. Save it in a compare
tibble where you will add the estimates obtained with different methods.
svymean(~ SMOKE100, nps_design)
mean SE
SMOKE100 0.489 0.0158
=svyciprop(~ SMOKE100, nps_design)
np_est np_est
2.5% 97.5%
SMOKE100 0.489 0.458 0.52
=confint(svyciprop(~ SMOKE100, nps_design))
ci= tibble(estimate=np_est[1],
compare lower_bound=ci[1],
upper_bound=ci[2],
method="nps")
The estimate of the proportion of individuals who smoked 100 or more cigarettes in their lifetime from the non-probability sample (NPS) is 48.9%, with a 95% confidence interval ranging from 45.8% to 52.0%. Note that the confidence interval does not include the true population proportion.
To improve our estimate, we will apply and compare various estimation techniques.
Inverse-probability weighting
When a probability sample is available
First, we assume that we have access to high quality data from a reference probability sample. Implement IPW using the nonprob
function with the following specifications:
Set up a selection model that includes the variables
AGECAT
,RACECAT
,EDCAT
, andINETHOME
Use as
target
variableSMOKE100
Use the probability sample design you created before as input for
svydesign
Use logistic regression in
method_selection
Save the estimate and the confidence interval
<-
ipw_logit_ps nonprob(
selection = ~ as.factor(AGECAT) + as.factor(RACECAT) + as.factor(EDCAT) + as.factor(INETHOME),
target = ~ SMOKE100,
data = np_sample,
svydesign = ps_design,
method_selection = "logit"
)
#display
cbind(ipw_logit_ps$output, ipw_logit_ps$confidence_interval)
mean SE lower_bound upper_bound
SMOKE100 0.5302948 0.03644835 0.4588573 0.6017322
#save
<- compare %>%
compare bind_rows(tibble(estimate=ipw_logit_ps$output$mean,
lower_bound=ipw_logit_ps$confidence_interval$lower_bound,
upper_bound=ipw_logit_ps$confidence_interval$upper_bound,
method="ipw_ps"))
The estimate is 53.02% (CI: 45.88% - 60.17%). This estimate is much closer to the true population proportion of 52.85%, and the confidence interval now includes the true value.
When only population information is available
Now we assume that only population totals are available.
- Implement IPW using the
nonprob
function with the previous specifications but instead ofsvydesign
, specify thepop_totals
.
<-
ipw_logit_poptotals nonprob(
selection = ~ AGECAT + RACECAT + EDCAT + INETHOME,
target = ~ SMOKE100,
data = np_sample,
pop_totals = pop_tot,
method_selection = "logit"
)
#display
cbind(ipw_logit_poptotals$output, ipw_logit_poptotals$confidence_interval)
mean SE lower_bound upper_bound
SMOKE100 0.5449075 0.07954355 0.389005 0.70081
#save
<- compare %>%
compare bind_rows(tibble(
estimate = ipw_logit_poptotals$output$mean,
lower_bound = ipw_logit_poptotals$confidence_interval$lower_bound,
upper_bound = ipw_logit_poptotals$confidence_interval$upper_bound,
method = "ipw_pop_tot"
))
Now, the estimate is 54.49% but the confidence interval is quite large (CI: 38.9% - 70%). This wide interval suggests considerable uncertainty in the estimate.
Mass Imputation
When a probability sample is available
First, we assume that we have access to high quality data from a reference probability sample. Implement Mass Imputation using the nonprob
function with the following specifications:
Set up an outcome model for
SMOKE100
that includes the variablesAGECAT
,RACECAT
,EDCAT
, andINETHOME
Use the probability sample design you created before as input for
svydesign
Given the outcome type, select the correct option for the
method_outcome
andfamily_outcome
Save the estimate and the confidence interval
<- nonprob(outcome = SMOKE100 ~ AGECAT + RACECAT + EDCAT + INETHOME,
mi_sample data = np_sample,
svydesign = ps_design,
method_outcome = "glm",
family_outcome = "binomial")
#display
cbind(mi_sample$output, mi_sample$confidence_interval)
mean SE lower_bound upper_bound
SMOKE100 0.5284921 0.02853384 0.4725668 0.5844174
#save
<- compare %>%
compare bind_rows(tibble(
estimate = mi_sample$output$mean,
lower_bound = mi_sample$confidence_interval$lower_bound,
upper_bound = mi_sample$confidence_interval$upper_bound,
method = "mi_sample"
))
The estimate is 52.84% (CI: 47.1% - 58.4%).
When only population information is available
Now we assume that only population totals are available.
- Implement Mass Imputation using the
nonprob
function with the previous specifications but instead ofsvydesign
, specify thepop_totals
.
<- nonprob(outcome = SMOKE100 ~ AGECAT + RACECAT + EDCAT + INETHOME,
mi_poptotals data = np_sample,
pop_totals = pop_tot,
method_outcome = "glm",
family_outcome = "binomial")
#display
cbind(mi_poptotals$output,mi_poptotals$confidence_interval)
mean SE lower_bound upper_bound
SMOKE100 0.5303917 0.03275615 0.4661909 0.5945926
#save
<- compare %>%
compare bind_rows(tibble(
estimate = mi_poptotals$output$mean,
lower_bound = mi_poptotals$confidence_interval$lower_bound,
upper_bound = mi_poptotals$confidence_interval$upper_bound,
method = "mi_pop_tot"
))
The estimate is 53.04% (CI: 46.6% - 59.5%).
Doubly-robust inference
When a probability sample is available
First, we assume that we have access to high quality data from a reference probability sample. Implement the Doubly-robust method using the nonprob
function with the following specifications:
Set up an outcome model for
SMOKE100
that includes the variablesAGECAT
,RACECAT
,EDCAT
, andINETHOME
Set up a selection model that includes the variables
AGECAT
,RACECAT
,EDCAT
, andINETHOME
Use the probability sample design you created before as input for
svydesign
Given the outcome type, select the correct option for the
method_outcome
andfamily_outcome
Use a logistic regression for the selection model
Save the estimate and the confidence interval
<-
dr_logit_sample nonprob(
selection = ~ AGECAT + RACECAT + EDCAT + INETHOME,
outcome = SMOKE100 ~ AGECAT + RACECAT + EDCAT + INETHOME,
data = np_sample,
svydesign = ps_design,
method_selection = "logit",
method_outcome = "glm",
family_outcome = "binomial"
)
#display
cbind(dr_logit_sample$output, dr_logit_sample$confidence_interval)
mean SE lower_bound upper_bound
SMOKE100 0.5263664 0.03303545 0.4616181 0.5911147
#save
<- compare %>%
compare bind_rows(tibble(
estimate = dr_logit_sample$output$mean,
lower_bound = dr_logit_sample$confidence_interval$lower_bound,
upper_bound = dr_logit_sample$confidence_interval$upper_bound,
method = "dr_logit_sample"
))
The estimate is 52.53% (CI: 46.2% - 59.1%).
When only population information is available
Now we assume that only population totals are available.
- Implement the Doubly-Robust approach again using the
nonprob
function with the previous specifications but instead ofsvydesign
, specify thepop_totals
.
<-
dr_logit_poptotals nonprob(
selection = ~ AGECAT + RACECAT + EDCAT + INETHOME,
outcome = SMOKE100 ~ AGECAT + RACECAT + EDCAT + INETHOME,
data = np_sample,
pop_totals = pop_tot,
method_selection = "logit",
method_outcome = "glm",
family_outcome = "binomial"
)
#display
cbind(dr_logit_poptotals $output, dr_logit_poptotals $confidence_interval)
mean SE lower_bound upper_bound
SMOKE100 0.5467979 0.02724855 0.4933917 0.6002041
#save
<- compare %>%
compare bind_rows(tibble(
estimate = dr_logit_poptotals $output$mean,
lower_bound = dr_logit_poptotals $confidence_interval$lower_bound,
upper_bound = dr_logit_poptotals $confidence_interval$upper_bound,
method = "dr_logit_poptotals"
))
The estimate is 54.7% (CI: 49.3% - 60.0%).
Comparing the results
Finally compare the estimates. Use can use ggplot
to plot the data in the compare
table.
%>%
compare ggplot(aes(
y = method,
x = estimate,
xmin = lower_bound,
xmax = upper_bound
+
)) geom_point() +
geom_vline(
xintercept = 0.5285,
linetype = "dotted",
color = "red"
+
) geom_errorbar() +
labs(x = "Point estimator and confidence interval", y = "estimators")+theme_minimal()
Bayesian approaches
Multilevel regression and post-stratification
We use the data from from the 2018 Cooperative Cooperative Congressional Election Study, a US nationwide survey designed by a consortium of 60 research teams and administered by YouGov. The data and the guide with information about the survey and the questionnaire area available here.
This tutorial is adapted from the Book MRP Case Studies that you can find here.
- Estimate the proportion of individuals who support the cut in the Corporate Income Tax rate. The questionnaire item is the following:
Congress considered many changes in tax law over the past two years. Do you support or oppose each of the following?
CC18_325a Taxes – Cut the Corporate Income Tax rate from 39 percent to 21 percent. (Support/Oppose)
- Compare the estimates obtained with the unadjasted nonprobability sample, with the estimates we would obtain if we had a large probability sample and the estimates obtained with MRP
The data
The variables are the following:
state
: 50 US statesage
: Age in classes, 18-29, 30-39, 40-49, 50-59, 60-69, 70+male
: Gender classified as male (1) or Female (2)eth
: Ethnicity classified as (Non-hispanic) White, Black, Hispanic, Other (which also includes Mixed)educ
: Education classified as No HS, HS, Some college, 4-year college, Post-gradregion
: Geographical region classified in Northeast, North Central, South, and Westrepvote
: Republican vote share in the 2016 presidential election
If you follow the book example, some data pre-processing are needed. Here, for simplicity, we use a ready-to-use version of the data.
Before starting, you need to install and load the following packages:
library(tidyverse)
library(rstan)
library(rstanarm)
library(dplyr)
library(readr)
library(ggplot2)
library(bayesplot)
library(ggalt)
library(reshape2)
You have the following dataset available:
statelevel_predictors_df
: for each one of the 50 US States contains therepvote
and theregion
cces_biased_df
: a nonprobability sample of size 5.000 from which we obtain un-adjusted estimates and then MRP estimatescces_all_df
: is the full probability sample of size 59.693 observations that we use only for comparisonpoststrat_df_full
: is the poststratification table with population counts constructed considering the state, the ethniticy, the gender, the age and the education of individuals. As we defined the levels for these variables, the poststratification table must have \(50 \cdot 6 \cdot 2 \cdot 4 \cdot 5 = 12.000\) rows. This means we actually have more rows in the poststratification table than observed units, which necessarily implies that there are some combinations in the poststratification table that we don’t observe in our nonprobability sample. In this example, we ensured that all cells in the poststratification table were included, even if their weight was zero, but this was done solely for illustrative purposes.
<- read_csv('statelevel_predictors.csv')
statelevel_predictors_df load("cces_all_df.rda")
load("cces_biased_df.rda")
load("poststrat_df_full.rda")
Preliminary analyses
Compare the composition of the nonprobability sample with respect to the population characteristics (in the poststratification table)
Produce plots and tables.
One of the possible solutions follows.
<-
age_sample %>%
cces_biased_df mutate(age = factor(age, ordered = FALSE)) %>%
group_by(age) %>%
summarise(n = n()) %>%
mutate(Sample = n /sum(n))
<- poststrat_df %>%
age_post mutate(age = factor(age, ordered = FALSE)) %>%
group_by(age) %>%
summarise(n_post = sum(n)) %>%
mutate(Population = n_post/sum(n_post))
<-
age inner_join(age_sample, age_post, by = "age") %>%
select(age, Sample, Population)
<- ggplot() + ylab("") + xlab("Proportion") +
age_plot theme_bw() + coord_flip() +
geom_dumbbell(data = age, aes(y = age, x = Sample, xend = Population)) +
geom_point(
data = melt(age, id = "age"),
aes(y = age, x = value, color = variable),
size = 2
+
) scale_x_continuous(limits = c(0, 0.35),
breaks = c(0, .1, .2, .3)) +
theme(legend.position = "none") +
ggtitle("Age")
# Gender
<-
male_sample %>%
cces_biased_df group_by(male) %>%
summarise(n = n()) %>%
mutate(Sample = n / sum(n))
<-
male_post %>%
poststrat_df group_by(male) %>%
summarise(n_post = sum(n)) %>%
mutate(Population = n_post / sum(n_post))
<-
male inner_join(male_sample, male_post, by = "male") %>%
select(male, Sample, Population) %>%
mutate(male = factor(
male,levels = c(-0.5,+0.5),
labels = c("Female", "Male")
))
<- ggplot() + ylab("") + xlab("") +
male_plot theme_bw() + coord_flip() +
geom_dumbbell(data = male, aes(y = male, x = Sample, xend = Population)) +
geom_point(
data = melt(male, id = "male"),
aes(y = male, x = value, color = variable),
size = 2
+
) scale_x_continuous(limits = c(0, 0.6),
breaks = c(0, .2, .4, .6)) +
theme(legend.position = "none") + ggtitle("Gender")
# Ethnicity
<-
ethnicity_sample %>%
cces_biased_df group_by(eth) %>%
summarise(n = n()) %>%
mutate(Sample = n / sum(n))
<-
ethnicity_post %>%
poststrat_df group_by(eth) %>%
summarise(n_post = sum(n)) %>%
mutate(Population = n_post / sum(n_post))
<-
ethnicity inner_join(ethnicity_sample, ethnicity_post, by = "eth") %>%
select(eth, Sample, Population)
$eth <- factor(
ethnicity$eth,
ethnicitylevels = c("Black", "Hispanic", "Other", "White"),
labels = c("Black", "Hispanic", "Other", "White")
)
<- ggplot() + ylab("") + xlab("") +
ethnicity_plot theme_bw() + coord_flip() +
geom_dumbbell(data = ethnicity, aes(y = eth, x = Sample, xend = Population)) +
geom_point(
data = melt(ethnicity, id = "eth"),
aes(y = eth, x = value, color = variable),
size = 2
+
) scale_x_continuous(limits = c(0, 0.9),
breaks = c(0, .2, .4, .6, 0.8)) +
theme(legend.position = "none") + ggtitle("Ethnicity")
# Education
<-
educ_sample %>%
cces_biased_df mutate(educ = factor(educ, ordered = FALSE)) %>%
group_by(educ) %>%
summarise(n = n()) %>%
mutate(Sample = n / sum(n))
<-
educ_post %>%
poststrat_df mutate(educ = factor(educ, ordered = FALSE)) %>%
group_by(educ) %>%
summarise(n_post = sum(n)) %>%
mutate(Population = n_post /sum(n_post))
<- inner_join(educ_sample, educ_post, by = "educ") %>%
educ select(educ, Sample, Population)
$educ <- factor(
educ$educ,
educlevels = c("No HS", "HS", "Some college", "4-Year College", "Post-grad"),
labels = c("No HS", "HS", "Some\nCollege", "4-year\nCollege", "Post-grad")
)
<- ggplot() + ylab("") + xlab("") +
educ_plot theme_bw() + coord_flip() +
geom_dumbbell(data = educ, aes(y = educ, x = Sample, xend = Population)) +
geom_point(
data = melt(educ, id = "educ"),
aes(y = educ, x = value, color = variable),
size = 2
+
) scale_x_continuous(limits = c(0, 0.33),
breaks = c(0, .1, .2, .3)) +
theme(legend.position = "none") + ggtitle("Education")
# State
<- cces_biased_df %>%
state_samplegroup_by(state) %>%
summarise(n = n()) %>%
mutate(Sample = n/sum(n))
<- poststrat_df %>%
state_post group_by(state) %>%
summarise(n_post = sum(n)) %>%
mutate(Population = n_post/sum(n_post))
<- full_join(state_sample, state_post, by = "state")
state
=state %>% inner_join(statelevel_predictors_df %>% select(state,repvote)) state
Joining with `by = join_by(state)`
<- state$repvote
states_order
$state <- fct_reorder(state$state, states_order) state
::ggarrange(age_plot, male_plot, ethnicity_plot, educ_plot, nrow=1,common.legend = TRUE, legend="bottom") ggpubr
%>%
state ggplot() +
ylab("") + xlab("Proportion") + theme_bw() + coord_flip() +
geom_dumbbell(aes(y = state, x = Sample, xend = Population)) +
geom_point(aes(y = state, x = Sample, color = "Sample"), size = 2) +
geom_point(aes(y = state, x = Population, color = "Population"), size = 2) +
scale_x_continuous(limits = c(0, 0.13), breaks = c(0, .025, .05, .075, .1, .125)) + ggtitle("State") +
theme(legend.position = "bottom", legend.title=element_blank())
We can see that in the nonprobability samples individuals tends to be older, male, white, less educated and from Republican states.
The Mister: MR
In this section you need to implement a multilevel regression in order to predict the outcome (
taxred
) based on a set of factors.Use the
stan_gmler
functionInclude in the model varying intercepts
(1 | x)
for:age
,eth
,educ
, andstate
Include interactions
(1 | x1:x2)
betweenmale
andeth
,educ
andage
andeduc
andeth
Include as state-level predictors also
repvote
andregion
Use a logistic model
library(rstanarm)
<- stan_glmer(taxred ~ (1 | state) + (1 | eth) + (1 | educ) + (1 | age) + male +
fit 1 | male:eth) + (1 | educ:age) + (1 | educ:eth) +
(+ factor(region),
repvote family = binomial(link = "logit"),
data = cces_biased_df,
prior = normal(0, 1, autoscale = TRUE),
prior_covariance = decov(scale = 0.50),
adapt_delta = 0.99,
refresh = 0,
seed = 1010)
Running this code might be computational expensive, thus you can directly load the data with the results:
<- readRDS("my_fit.rds") fit
The P: Postststratification
The second step is to poststratify.
Draw 4000 values from the posterior predictive distribution using
posterior_epred
using thepoststrat_df
Compute the vector of MRP estimates for the 4000 draws, remember from the slides
pred * n/N
. Note that the predictions comes from 4000 draws from the predictive distribution, thus you have a matrixCompute the final general MRP estimates taking the mean and the sd
<- posterior_epred(fit, newdata = poststrat_df, draws = 4000)
epred_mat <- epred_mat %*% poststrat_df$n / sum(poststrat_df$n)
mrp_estimates_vector <- c(mean = mean(mrp_estimates_vector), sd = sd(mrp_estimates_vector)) mrp_estimate
- Compute the mean and the standard error from the row nonprobability data and from the larger probability sample data (for a bernoulli distribution it is
sqrt(p*(1-p)/n)
<- function(p, n){
get_se_bernoulli return(sqrt(p*(1-p)/n))
}<- c(mean = mean(cces_biased_df$taxred), se = get_se_bernoulli(mean(cces_biased_df$taxred), nrow(cces_biased_df)))
cces_estimate <- c(mean = mean(cces_all_df$taxred), se = get_se_bernoulli(mean(cces_all_df$taxred), nrow(cces_all_df))) full_cces_estimate
Compute the MRP and the unadjasted estimates for each of the 50 US states
Compute the estimates for each of the 50 US States using the probability sample data
Use the following code. Take your time to understand it
<- datasets::state.abb
state_abb
<- data.frame(
states_df state = state_abb,
mrp_estimate = NA,
mrp_estimate_se = NA,
cces_estimate = NA,
cces_estimate_se = NA,
full_cces_estimate = NA,
full_cces_estimate_se = NA,
n_sample = NA,
n_full = NA
)
for (i in 1:nrow(states_df)) {
<-
filtering_condition which(poststrat_df$state == states_df$state[i])
<- epred_mat[, filtering_condition]
state_epred_mat <- poststrat_df[filtering_condition,]$n
k_filtered <-
mrp_estimates_vector %*% k_filtered / sum(k_filtered)
state_epred_mat %>% count(state)
states_df
# MRP estimate
$mrp_estimate[i] <- mean(mrp_estimates_vector)
states_df$mrp_estimate_se[i] <- sd(mrp_estimates_vector)
states_df
# Biased sample survey unadjusted estimate
$cces_estimate[i] <-
states_dfmean(filter(cces_biased_df, state == states_df$state[i])$taxred)
$n_sample[i] <-
states_dfnrow(filter(cces_biased_df, state == states_df$state[i]))
$cces_estimate_se[i] <-
states_dfget_se_bernoulli(states_df$cces_estimate[i], states_df$n_sample[i])
# Full PS sample survey unadjusted estimate
$full_cces_estimate[i] <-
states_dfmean(filter(cces_all_df, state == states_df$state[i])$taxred)
$n_full[i] <-
states_dfnrow(filter(cces_all_df, state == states_df$state[i]))
$full_cces_estimate_se[i] <-
states_dfget_se_bernoulli(states_df$full_cces_estimate[i], states_df$n_full[i])
}
Compare the results
- Compare the results using
ggplot
<- ggplot(data = states_df) +
compare1 geom_point(aes(x = state, y = cces_estimate), color = "#E37B1C") +
geom_errorbar(
aes(
ymin = cces_estimate - 2 * cces_estimate_se,
ymax = cces_estimate + 2 * cces_estimate_se,
x = state
),alpha = .5,
width = 0,
color = "#E37B1C"
+
) geom_point(data = states_df,
aes(x = state, y = mrp_estimate),
color = "#7B1CE3") +
geom_errorbar(
data = states_df,
aes(
ymin = mrp_estimate - 2 * mrp_estimate_se,
ymax = mrp_estimate + 2 * mrp_estimate_se,
x = state
),alpha = .5,
width = 0,
color = "#7B1CE3"
+
) geom_point(aes(x = state, y = full_cces_estimate), color = "#1CE37B") +
geom_errorbar(
data = states_df,
aes(
ymin = full_cces_estimate - 2 * full_cces_estimate_se,
ymax = full_cces_estimate + 2 * full_cces_estimate_se,
x = state
),alpha = .5,
width = 0,
color = "#1CE37B"
+
) scale_y_continuous(
breaks = c(0, .25, .5, .75, 1),
labels = c("0%", "25%", "50%", "75%", "100%"),
expand = c(0, 0)
+
) coord_cartesian(ylim = c(0, 1)) +
theme_bw() +
labs(x = "States", y = "Support") +
theme(
legend.position = "none",
axis.title = element_text(size = 10),
axis.text.y = element_text(size = 10),
axis.text.x = element_text(
angle = 90,
size = 8,
vjust = 0.3
),legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
<- ggplot(data = data.frame()) +
compare2 geom_point(aes(y = cces_estimate[1], x = .25), color = "#E37B1C") +
geom_errorbar(
data = data.frame(),
aes(
y = cces_estimate[1],
x = .25,
ymin = cces_estimate[1] - 2 * cces_estimate[2],
ymax = cces_estimate[1] + 2 * cces_estimate[2]
),width = 0,
color = "#E37B1C"
+
) geom_text(
aes(x = Inf, y = cces_estimate[1] + 0.06, label = "Unadjusted Sample"),
hjust = -.05,
size = 4,
color = "#E37B1C"
+
) geom_point(aes(y = mrp_estimate[1], x = .75), color = "#7B1CE3") +
geom_errorbar(
data = data.frame(),
aes(
y = mrp_estimate[1],
x = .75,
ymin = mrp_estimate[1] - 2 * mrp_estimate[2],
ymax = mrp_estimate[1] + 2 * mrp_estimate[2]
),width = 0,
color = "#7B1CE3"
+
) geom_text(
data = data.frame(),
aes(x = Inf, y = mrp_estimate[1] + 0.02, label = "Sample with MRP"),
hjust = -.05,
size = 4,
color = "#7B1CE3"
+
) scale_y_continuous(
breaks = c(0, .25, .5, .75, 1),
labels = c("0%", "25%", "50%", "75%", "100%"),
limits = c(0, 1),
expand = c(0, 0)
+
) geom_point(data = data.frame(),
aes(y = full_cces_estimate[1], x = .5),
color = "#1CE37B") +
geom_errorbar(
data = data.frame(),
aes(
y = full_cces_estimate[1],
x = .5,
ymin = full_cces_estimate[1] - 2 *
2],
full_cces_estimate[ymax = full_cces_estimate[1] + 2 *
2]
full_cces_estimate[
),width = 0,
color = "#1CE37B"
+
) geom_text(
data = data.frame(),
aes(x = Inf, y = full_cces_estimate - 0.04, label = "Complete Survey"),
hjust = -.06,
size = 4,
color = "#1CE37B"
+
) scale_y_continuous(
breaks = c(0, .25, .5, .75, 1),
labels = c("0%", "25%", "50%", "75%", "100%"),
limits = c(0, 1),
expand = c(0, 0)
+
) scale_x_continuous(
limits = c(0, 1),
expand = c(0, 0),
breaks = c(.25, .75)
+
) coord_cartesian(clip = 'off') +
theme_bw() +
labs(x = "Population", y = "") +
theme(
legend.position = "none",
axis.title.y = element_blank(),
axis.title.x = element_text(size = 10, margin = margin(
t = 19,
r = 0,
b = ,
l = 0
)),axis.text = element_blank(),
axis.ticks = element_blank(),
legend.title = element_text(size = 10),
legend.text = element_text(size = 10),
plot.margin = margin(5.5, 105, 5.5, 5.5, "pt")
)
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
bayesplot_grid(compare1,compare2,
grid_args = list(nrow=1, widths = c(5,1.4)))
Acknowledgments
Some of the exercises are adapted from:
Valliant, R., Dever, J. A., & Kreuter, F. (2018). Practical tools for designing and weighting survey samples (Vol. 1). New York: Springer.
The MRP Case Studies online book that you can find here.