The Election Challange

Author

Camilla Salvatore, Peter Lugtig

Introduction

This material is part of the Utrecht Summer School S16 - Survey Research: Statistical Analysis and Estimation.

The Problem

The dataset(s) you will use today are a slightly adapted from data that are publicly available through PEW, which is a non-profit survey data collection organisation in the USA.

Between June and July 2016, PEW designed a short questionnaire and then asked three organizations that rely on volunteer opt-in survey panels to administer this questionnaire in their panel. This resulted in a dataset of about 30.000 respondents. You are today getting about two/thirds of these respondents to develop an adjustment method. The remaining 10.000 respondents will serve as the hold-out sample against which I will test your adjustment method. I also excluded many variables: your dataset will consist of about 30 variables which you can use for adjustment, and 1 dependent variable called VOTESUM.

The variable VOTESUM asks for the future voting behavior in the November 2016 Presidential Election, with a choice between Clinton, Trump and being undecided. At 1 July, the aggregated polls indicated that Clinton would receive 46% of the vote, and Trump 42%, with the rest being undecided. If I correct for the bias that was present in the polls throughout the 2016 election cycle, my best guess for the true difference between the candidates would have been 45% for Clinton and 43% for Trump.

If you want to know more about the study and data, have a look here.

The Data

You are provided with a dataset derived from a survey conducted by PEW between June and July 2016. The survey includes 30 variables related to demographic, socioeconomic, and behavioral characteristics, along with a dependent variable VOTESUM, which captures respondents’ intended voting behavior.

Dependent Variable:

VOTESUM: Respondent’s intended vote in the November 2016 Presidential Election (Clinton, Trump, Undecided)

Covariates: The dataset contains 30 covariates, including demographic variables such as gender, age, education, and additional variables related to lifestyle, political engagement, and more. These variables may be correlated with both the selection bias (R) and the dependent variable (Y).

List of Variables:

GENDER: Male, Female
AGE: Age in years
EDUCCAT5: Educational level (5 categories), low to high
DIVISION: Region (4 categories)
MARITAL_ACS: Marital status (5 categories)
HHSIZECAT: Household size (1, 2, 3+)
CHILDRENCAT: Children at home (1, 2, 3+)
CITIZEN_REC: U.S. citizen (Yes, No)
BORN_ACS: Born in or outside the USA
FAMINC5: Income (5 bands: <20k, 20-40k, 40-75k, 75-150k, >150k)
EMPLOYED: Employment status (Yes, No)
MIL_ACS_REC: Military service (Never, Has served)
HOME_ACS_REC: Homeownership status (Own, Rent, Rent without pay)
FDSTMP_CPS: Receipt of food stamps (Yes, No)
TENURE_ACS: Residence at current address one year ago (Yes, No)
PUB_OFF_CPS: Visited a public official to express opinion in the past 12 months (Yes, No)
COMGRP_CPS: Participation in community groups in the past year (Yes, No)
TALK_CPS: Frequency of talking to family (5 categories)
TRUST_CPS: Trust in neighbors (5 categories)
TABLET_CPS: Use of a tablet (Yes, No)
TEXTIM_CPS: Sending text messages (Yes, No)
SOCIAL_CPS: Active on social media (Yes, No)
VOLSUM: Volunteering status (Yes, No)
REGISTERED: Registered to vote (Yes, No)
VOTE14: Voted in 2014 midterm election (Yes, No)
PARTYSCALE5: Party attachment (5 categories)
RELIGCAT: Religious affiliation (5 categories: Roman Catholic, Evangelical, Main Protestant, Other, Unaffiliated)
IDEO3: Ideological orientation (Liberal, Moderate, Conservative)
OWNGUN_GSS: Owns a gun (Yes, No)
FOLGOV: Follows the government (Yes, No)

You have two datasets available.

The non-probability dataset: with a sample from three different vendors

Follow this code to load and inspect the data.

nonprob <- readRDS("PEW nonprob samples HOLDOUT.RDS")

# below you see how many cases there are from each vendor
table(nonprob$vendor)


Vendor 1 Vendor 2 Vendor 3 
    4927     4706     5113

The reference dataset that you can use either as:
- a reference probability samples
- the population from which extract population totals

Follow this code to load the data

popdata <- readRDS("PEW population data.RDS")

Your tasks:

You have 30 variables to use in order to build a model from and predicting voting behaviour starting form a nonprobability sample. You can use just one of the probability samples or compare the three.

Please note the following:

It is nice if the covariates you use also predict Y (voting behavior). There is no need to read the literature on voting behavior, but do think about this when you select your adjustment variables
If you want to include 30 variables in one go into your model, perhaps with some interactions, you will encounter some issues. First of all, R may become quite slow! Second, there may be estimation issues, because these are perhaps just many variables. Maybe it is worthwhile to first try a model with just a few variables