import numpy as np
import pandas as pd
In class activity
This notebook details the voter survey simulation presented in lecture as part of the Miller case study. Recall that the point of the simulation is to show that, under some assumptions about the sampling design and missing data mechanism, a strongly biased result is expected even when the actual rate of erroneous/fraudulent mail ballot requests is very low.
For this activity you’ll get to tinker with the simulation settings to better understand the example and the factors that might impact bias in the results.
The cell below simulates a hypothetical population of 150,000 voters who were issued mail ballots according to state records.
The quantity true_prop
is the population parameter we will ultimately estimate; this parameter is the proportion of the voters who were issued mail ballots according to state records but who did not request mail ballots. For these voters, either the mail ballots were issued erroneously or they were fraudulently requested. In context, a large estimate for this quantity is suggestive of some irregularities pertaining to the mail-in vote.
Below this parameter of interest is set at \(0.5\%\).
Based on true_prop
, an indicator is assigned to each voter that is a 1 if the voter requested the mail ballot and a 0 otherwise; for simplicity, all zeroes are added to the top \(N\times\)true_prop
rows and the remaining rows are assigned ones.
# for reproducibility
41021)
np.random.seed(
# proportion of fraud/error
= 0.005
true_prop
# generate population of voters
= 150000
N = pd.DataFrame(data = {'requested': np.ones(N)})
population
# add a label indicating whether the voter requested a mail ballot
= round(N*true_prop) - 1
num_nrequest 0:num_nrequest, 0] = 0 population.iloc[
Simulating sampling mechanisms
The cell below assigns sampling weights that represent the probability a voter in the population answers the phone and agrees to an interview.
The weights can be thought of as expected conditional response rates – the probabilities that (a) a voter is interviewed given they did request a mail ballot and (b) a voter is interviewed given that they did not request a mail ballot.
The assumption figuring in the weight calculation is that voters who did not request mail ballots are more likely to agree to an interview. This would naturally occur if the interviewer is not careful about the interview request and discloses immediately that they are investigating irregularities in mail ballot requests – those who didn’t experience any irregularities are much more likely to hang up or decline.
Currently, it is assumed that voters who did not request ballots are 10 times more likely to talk than those who did. This factor is stored as talk_factor
. The overall response rate is set at \(5\%\) and stored as p_talk
. The weight calculation proceeds using the law of total probability:
\[ P(T) = P(T|R)\left(P(R) + \underbrace{\frac{P(T|NR)}{P(T|R)}}_{\text{talk factor}}P(NR)\right) \]
Rearrangement yields an expression for \(P(T|R)\) in terms of the request rates, overall response rate, and talking factor.
# probability that a randomly chosen voter requested a mail ballot
= 1 - true_prop
p_request
# probability that a randomly chosen voter did not request a mail ballot
= true_prop
p_nrequest
# assume respondents who did not request are more likely to talk by this factor
= 10
talk_factor
# overall response rate
= 0.05
p_talk
# conditional response rates
= p_talk/(p_request + talk_factor*p_nrequest)
p_talk_request = talk_factor*p_talk_request
p_talk_nrequest
# print
print('rate for requesters: ', p_talk_request)
print('rate for non-requesters: ', p_talk_nrequest)
rate for requesters: 0.04784688995215312
rate for non-requesters: 0.47846889952153115
If you like, feel free to adjust the overall response rate and talking factor to values that interest you empirically. The question to ask to determine these values is:
If I assume the response rate is \(x\) and that those who did not request mail ballots are \(y\) times more likely to talk, how much bias will that induce for the estimated proportion of erroneous/fraudulent requests?
Choose values for \(x\) and \(y\) for which you’d like to know the answer. Make sure the conditional rates are valid probabilities. The cell below will draw a sample for your specifications.
# sample size
= 2500
n
# draw sample weighted by conditional probabilities
41923)
np.random.seed(== 1, 'sample_weight'] = p_talk_request
population.loc[population.requested == 0, 'sample_weight'] = p_talk_nrequest
population.loc[population.requested = population.sample(n = n, replace = False, weights = 'sample_weight') samp
The cell below returns the estimated proportion of erroneous/fraudulent requests and the error associated with this estimate.
print('estimated fraudulent/erroneous requests: ', 1 - samp.requested.mean())
print('true value: ', true_prop)
print('estimation error: ', 1 - samp.requested.mean() - true_prop)
estimated fraudulent/erroneous requests: 0.04400000000000004
true value: 0.005
estimation error: 0.03900000000000004
Extrapolating this estimate to a raw vote count among the population yields the following:
print('estimated fraudulent/erroneous requests: ',
round(N*(1 - samp.requested.mean())))
np.print('true value: ', N*true_prop)
print('estimation error: ',
round(N*(1 - samp.requested.mean() - true_prop))) np.
estimated fraudulent/erroneous requests: 6600.0
true value: 750.0
estimation error: 5850.0
The bias – average error across samples – can be estimated by repeating this sampling scheme many times. The cell below computes estimates for nsim
simulated samples.
# for reproducibility
41923)
np.random.seed(
# number of simulated samples
= 1000
nsim
# storage for the estimates from each sample
= np.zeros(nsim)
estimates
# for each simulation ...
for i in range(0, nsim):
# draw a sample and compute the estimated proportion
= population.sample(n = n,
estimates[i] = False,
replace = 'sample_weight'
weights ).requested.mean()
The average error for this sampling design is given below.
print('average estimate: ', np.mean((1 - estimates)))
print('standard deviation of estimates: ', np.std(estimates))
print('truth: ', true_prop)
print('bias (proportion): ', np.mean((1 - estimates) - true_prop))
print('bias (count): ', np.mean(N*((1 - estimates) - true_prop)))
average estimate: 0.04465079999999999
standard deviation of estimates: 0.0037278169697558933
truth: 0.005
bias (proportion): 0.0396508
bias (count): 5947.62
Activity 1: tinker with the sampling mechanism
Take note of these results or make a duplicate of the cell and re-run it so that you have a copy for later reference. Now go back and adjust the settings. Repeat the simulation and compare changes.
Some questions you could explore are:
- how does increasing the overall response rate impact the bias?
- does sample size matter?
- what response rate(s) and talking factor(s) would produce estimates of 10% or more?
Simulating missingness
Some interviews were terminated early because the respondent hung up or declined to proceed. We can think of these instances as missing values.
The cell below creates missingness probabilities under the assumption that those who did request mail ballots are more likely to terminate interviews than those who did not. The calculation is exactly the same as that used to figure sampling weights.
# assume requesters are more likely to terminate early by this factor
= 12
missing_factor
# overall observed missing rate
= 0.5
p_missing
# proportions of requesters/nonrequesters in sample
= samp.requested.mean()
p_request_samp = 1 - p_request_samp
p_nrequest_samp
# conditional probabilities of missing given request status
= p_missing/(p_nrequest + missing_factor*p_request)
p_missing_nrequest = missing_factor*p_missing_nrequest
p_missing_request
print('missing rate for requesters: ', p_missing_request)
print('missing rate for nonrequesters: ', p_missing_nrequest)
missing rate for requesters: 0.502302218501465
missing rate for nonrequesters: 0.04185851820845542
The following cell inputs missing values at random according to the missingness mechanism specified above.
# append missingness probabilities
== 1, 'missing_weight'] = p_missing_request
samp.loc[samp.requested == 0, 'missing_weight'] = p_missing_nrequest
samp.loc[samp.requested
# make a copy of the sample
= samp.copy()
samp_incomplete
# input missing values at random
41923)
np.random.seed('missing'] = np.random.binomial(n = 1, p = samp_incomplete.missing_weight.values)
samp_incomplete[== 1, 'requested'] = float('nan') samp_incomplete.loc[samp_incomplete.missing
Finally, ignoring these missing responses yields the estimate below of the proportion of erroneous/fraudulent ballot requests.
print('estimated fraudulent/erroneous requests: ',
1 - samp_incomplete.requested.mean())
print('true value: ', true_prop)
print('estimation error: ',
1 - samp_incomplete.requested.mean() - true_prop)
estimated fraudulent/erroneous requests: 0.08307453416149069
true value: 0.005
estimation error: 0.07807453416149068
Extrapolating this estimate to raw vote counts among the population yields the following:
print('estimated fraudulent/erroneous requests: ',
round(N*(1 - samp_incomplete.requested.mean())))
np.print('true value: ', N*true_prop)
print('estimation error: ',
round(N*(1 - samp_incomplete.requested.mean() - true_prop))) np.
estimated fraudulent/erroneous requests: 12461.0
true value: 750.0
estimation error: 11711.0
Repeating the entire experiment – sampling from the population and then introducing missing values – many times will allow for an assessment of the additional bias due to missingness.
# for reproducibility
41923)
np.random.seed(
# number of simulated samples
= 1000
nsim
# storage for estimates
= np.zeros([1000, 2])
estimates
# for each simulation
for i in range(0, nsim):
# draw sample from population
= population.sample(n = 2500,
samp_complete = False,
replace = 'sample_weight'
weights
)
# compute mean from complete data
0] = samp_complete.requested.mean()
estimates[i,
# introduce missing values
== 1, 'missing_weight'] = p_missing_request
samp_complete.loc[samp_complete.requested == 0, 'missing_weight'] = p_missing_nrequest
samp_complete.loc[samp_complete.requested
# make a copy of the sample
= samp.copy()
samp_incomplete
# input missing values at random
'missing'] = np.random.binomial(n = 1, p = samp_incomplete.missing_weight.values)
samp_incomplete[== 1, 'requested'] = float('nan')
samp_incomplete.loc[samp_incomplete.missing 1] = samp_incomplete.requested.mean() estimates[i,
Note that both an estimate with complete data (no missing values) and with incomplete data (with missing values that are dropped) are computed. This allows us to compute average errors with and without missingness, and thus, average excess error due to missingness.
= 1 - np.mean(estimates, axis = 0)
avg_estimates
print('average estimate without missingness: ', avg_estimates[0])
print('average estimate with missingness: ', avg_estimates[1])
print('total bias: ', avg_estimates[1] - true_prop)
print('bias due to sampling: ', avg_estimates[0] - true_prop)
print('excess bias due to missingness: ', avg_estimates[1] - avg_estimates[0])
average estimate without missingness: 0.04469400000000001
average estimate with missingness: 0.08155802934699397
total bias: 0.07655802934699396
bias due to sampling: 0.039694000000000014
excess bias due to missingness: 0.036864029346993954
In terms of raw vote counts, these same quantities are:
print('average estimate without missingness: ', N*avg_estimates[0])
print('average estimate with missingness: ', N*avg_estimates[1])
print('total bias: ', N*(avg_estimates[1] - true_prop))
print('bias due to sampling: ', N*(avg_estimates[0] - true_prop))
print('excess bias due to missingness: ', N*(avg_estimates[1] - avg_estimates[0]))
average estimate without missingness: 6704.100000000001
average estimate with missingness: 12233.704402049094
total bias: 11483.704402049094
bias due to sampling: 5954.100000000002
excess bias due to missingness: 5529.604402049093
Activity 2: tinker with missingness mechanism
Go back and adjust the settings for inputting missing values. Choose a missingness factor and overall nonresponse rate that interest you. Some questions you could explore are:
- what is the effect of a very high nonresponse rate with little differentiation between requesters and nonrequesters?
- are there any missing data mechanisms that would actually reduce bias?
- if the missing mechanism is similar to the sampling mechanism in how it favors nonrequesters, which has the larger effect?
Extra credit assignment
Design and carry out a simulation to further explore how bias due to sampling changes as a function of the factor by which respondents who did not request ballots are more likely to be interviewed. Ignore the potential impact of missing values and focus just on the sampling design.
Fix an evenly-spaced grid of values for the talking factor between 1 and 25. For each value, simulate 1000 samples and calculate the estimate of the proportion of fraudulent/erroneous ballot requests for each sample. For each set of 1000 samples, store: (1) the average estimate; (2) the standard deviation of estimates. Plot the estimated bias (average estimate - true proportion) as a function of talking factor, and add uncertainty bands at \(\pm 2SD\). Repeat the entire procedure for overall response rates of \(10\%\), \(20\%\), and \(30\%\).
Prepare and submit a notebook detailing the simulation study and briefly explaining the results.