# Initialize Otter
import otter
= otter.Notebook("lab2-sampling.ipynb") grader
Lab 2: Sampling designs and statistical bias
import numpy as np
import pandas as pd
import altair as alt
# disable row limit for plotting
alt.data_transformers.disable_max_rows()# uncomment to ensure graphics display with pdf export
# alt.renderers.enable('mimetype')
Lab 2: Sampling designs and statistical bias
In this lab you’ll explore through simulation how nonrandom sampling can produce datasets with statistical properties that are distored relative to the population that the sample was drawn from. This kind of distortion is known as bias.
In common usage, the word ‘bias’ means disproportion or unfairness. In statistics, the concept has the same connotation – biased sampling favors certain observational units over others, and biased estimates are estimates that favor larger or smaller values than the truth. The goal of this lab is to refine your understanding about what statistical bias is and is not and develop your intuition about potential mechanisms by which bias is introduced and the effect that this can have on sample statistics.
Objectives:
- Simulate biased and unbiased sampling designs
- Examine the impact of sampling bias on the sample mean
- Apply a simple bias correction by inverse probability weighting
Background
Sampling designs
The sampling design of a study refers to the way observational units are selected from the collection of all observational units. Any design can be expressed by the probability that each unit is included in the sample. In a random sample, all units are equally likely to be included.
For example, you might want to learn about U.S. residents (population), but only be able for ethical or practical reasons to study adults (sampling frame), and decide to do a mail survey of 2000 randomly selected addresses in each state (sampling design). Each collection of 2000 addresses may constitute a random sample of households, but even with a 100% response rate the survey results will not be a random sample of adult U.S. residents because individuals share addresses and the population sizes are different from state to state.
Bias
Formally, bias describes the ‘typical’ deviation of a sample statistic the correspongind population value.
For example, if a particular sampling design tends to produce an average measurement around 1.5 units, but the true average in the population is 2 units, then the estimate has a bias of -0.5 units. The language ‘typical’ and ‘tends to’ is important here. Estimates are never perfect, so just because an estimate is off by -0.5 units for one sample doesn’t make it biased – it is only biased if it is consistently off.
Although bias is technically a property of a sample statistic (like the sample average), it’s common to talk about a biased sample – this term refers to a dataset collected using a sampling design that produces biased statistics.
This is exactly what you’ll explore in this lab – the relationship between sampling design and bias.
Simulated data
You will be simulating data in this lab. Simulation is a great means of exploration because you can control the population properties, which are generally unknown in practice.
When working with real data, you just have one dataset, and you don’t know any of the properties of the population or what might have happened if a different sample were collected. That makes it difficult to understand sampling variation and impossible to directly compare the sample properties to the population properties.
With simulated data, by contrast, you control how data are generated with exact precision – so by extension, you know everything there is to know about the population. In addition, repeated simulation of data makes it possible to explore the typical behavior of a particular sampling design, so you can learn ‘what usually happens’ for a particular sampling design by direct observation.
Scenario 1: eucalyptus seed diameters
In this scenario you’ll compare the sample mean and the distribution of sample values for a single viariable with the population mean and distribution of population values for an unbiased sampling design.
Hypothetical population
To provide a little context to this scenario, imagine that you’re measuring eucalyptus seeds to determine their typical diameter. The cell below simulates diameter measurements for a hypothetical population of 5000 seeds; imagine that this is the total number of seeds in a small grove at some point in time.
# simulate seed diameters
40221) # for reproducibility
np.random.seed(= pd.DataFrame(
population = {'diameter': np.random.gamma(shape = 2, scale = 1/2, size = 5000),
data 'seed': np.arange(5000)}
'seed')
).set_index(
# check first few rows
3) population.head(
Question 1
Calculate the mean diameter for the hypothetical population and store the value as mean_diameter
.
= ...
mean_pop_diameter
mean_pop_diameter
"q1") grader.check(
Question 2
Calculate the standard deviation of diameters for the hypothetical population and store the value as std_dev_pop_diameter
.
= ...
std_dev_pop_diameter std_dev_pop_diameter
"q2") grader.check(
The cell below produces a histogram of the population values – the distribution of diameter measurements among the hypothetical population – with a vertical line indicating the population mean.
# base layer
= alt.Chart(population).properties(width = 400, height = 300)
base_pop
# histogram of diameter measurements
= base_pop.mark_bar(opacity = 0.8).encode(
hist_pop = alt.X('diameter',
x bin = alt.Bin(maxbins = 20),
= 'Diameter (mm)',
title = alt.Scale(domain = (0, 6))),
scale = alt.Y('count()', title = 'Number of seeds in population')
y
)
# vertical line for population mean
= base_pop.mark_rule(color='blue').encode(
mean_pop = 'mean(diameter)'
x
)
# display
+ mean_pop hist_pop
Random sampling
Imagine that your sampling design involves collecting bunches of plant material from several locations in the grove and sifting out the seeds with a fine sieve until you obtaining 250 seeds. We’ll suppose that using your collection method, any of the 5000 seeds is equally likely to be obtained, so that your 250 seeds comprise a random sample of the population.
We can simulate samples obtained using your hypothetical design by drawing values without replacement from the population.
# draw a random sample of seeds
40221) # for reproducibility
np.random.seed(= population.sample(n = 250, replace = False) sample
Question 3
Calculate the mean diameter of seeds in the simulated sample and store the value as mean_sample_diameter
.
= ...
mean_sample_diameter mean_sample_diameter
"q3") grader.check(
You should see above that the sample mean is close to the population mean. In fact, all sample statistics are close to the population; this can be seen by comparing the distribution of sample values with the distribution of population values.
# base layer
= alt.Chart(sample).properties(width = 400, height = 300)
base_samp
# histogram of diameter measurements
= base_samp.mark_bar(opacity = 0.8).encode(
hist_samp = alt.X('diameter',
x bin = alt.Bin(maxbins = 20),
= alt.Scale(domain = (0, 6)),
scale = 'Diameter (mm)'),
title = alt.Y('count()', title = 'Number of seeds in sample')
y
)
# vertical line for population mean
= base_samp.mark_rule(color='blue').encode(
mean_samp = 'mean(diameter)'
x
)
# display
+ mean_samp | hist_pop + mean_pop hist_samp
While there are some small differences, the overall shape is similar and the sample mean is almost exactly the same as the population mean. So with this sampling design, you obtained a dataset with few distortions of the population properties, and the sample mean is a good estimate of the population mean.
Assessing bias through simulation
You may wonder: does that happen all the time, or was this just a lucky draw? This question can be answered by simulating a large number of samples and checking the average behavior to see whether the undistorted representation of the population is typical for this sampling design.
The cell below estimates the bias of the sample mean by:
- drawing 1000 samples of size 300;
- storing the sample mean from each sample;
- computing the average difference between the sample means and the population mean.
40221) # for reproducibility
np.random.seed(
# number of samples to simulate
= 1000
nsim
# storage for the sample means
= np.zeros(nsim)
samp_means
# repeatedly sample and store the sample mean
for i in range(0, nsim):
= population.sample(n = 250, replace = False).diameter.mean() samp_means[i]
The bias of the sample mean is its average distance from the population mean. We can estimate this using our simulation results as follows:
# bias
- population.diameter.mean() samp_means.mean()
So the average error observed in 1000 simulations was about 0.001 mm! This suggests that the sample mean is unbiased: on average, there is no error. Therefore, at least with respect to estimating the population mean, random samples appear to be unbiased samples.
However, unbiasedness does not mean that you won’t observe estimation error. There is a natural amount of variability from sample to sample, because in each sample a different collection of seeds is measured. We can estimate this as well using the simulation results by checking the standard deviation of the sample means across all 1000 samples:
samp_means.std()
So on average, the sample mean varies by about 0.04 mm from sample to sample.
We could also check how much the sample mean deviates from the population mean on average by computing root mean squared error:
sum((samp_means - population.diameter.mean())**2)/1000) np.sqrt(np.
Note that this is very close to the variance of the sample mean across simulations, but not exactly the same; this latter calculation measures the spread around the population mean, and is a conventional measure of estimation accuracy.
The cell below plots a histogram representing the distribution of values of the sample mean across the 1000 samples you simulated (this is known as the sampling distribution of the sample mean). It shows a peak right at the population mean (blue vertical line) but some symmetric variation to either side – most values are between about 0.93 and 1.12.
# plot the simulated sampling distribution
= alt.Chart(pd.DataFrame({'sample mean': samp_means})).mark_bar().encode(
sampling_dist = alt.X('sample mean', bin = alt.Bin(maxbins = 30), title = 'Value of sample mean'),
x = alt.Y('count()', title = 'Number of simulations')
y
)
+ mean_pop sampling_dist
Biased sampling
In this scenario, you’ll use the same hypothetical population of eucalyptus seed diameter measurements and explore the impact of a biased sampling design.
In the first design, you were asked to imagine that you collected and sifted plant material to obtain seeds. Suppose you didn’t know that the typical seed is about 1mm in diameter and decided to use a sieve that is a little too coarse, tending only to sift out larger seeds and letting smaller seeds pass through. As a result, small seeds have a lower probability of being included in the sample and large seeds have a higher probability of being included in the sample.
This kind of sampling design can be described by assigning differential sampling weights \(w_1, \dots, w_N\) to each observation. The cell below defines some hypothetical weights such that larger diameters are more likely to be sampled.
= population.copy() population_mod1
# inclusion weight as a function of seed diameter
def weight_fn(x, r = 10, c = 1.5):
= 1/(1 + np.e**(-r*(x - c)))
out return out
# create a grid of values to use in plotting the function
= np.linspace(0, 6, 100)
grid = pd.DataFrame(
weight_df 'seed diameter': grid,
{'weight': weight_fn(grid)}
)
# plot of inclusion probability against diameter
= alt.Chart(weight_df).mark_area(opacity = 0.3, line = True).encode(
weight_plot = 'seed diameter',
x = 'weight'
y = 100)
).properties(height
# show plot
weight_plot
The actual probability that a seed is included in the sample – its inclusion probability – is proportional to the sampling weight. These inclusion probabilities \(\pi_i\) can be calculated by normalizing the weights \(w_i\) over all seeds in the population \(i = 1, \dots, 5000\):
\[\pi_i = \frac{w_i}{\sum_i w_i}\]
It may help you to picture how the weights will be used in sampling to line up this plot with the population distribution. In effect, we will sample more from the right tail of the population distribution, where the weight is nearest to 1.
& weight_plot hist_pop
The following cell draws a sample with replacement from the hypothetical seed population with seeds weighted according to the inclusion probability given by the function above.
# assign weight to each seed
'weight'] = weight_fn(population_mod1.diameter)
population_mod1[
# draw weighted sample
40721)
np.random.seed(= population_mod1.sample(n = 250, replace = False, weights = 'weight').loc[:, ['diameter']] sample2
Question 4
Calculate the mean diameter of seeds in the simulated sample and store the value as mean_sample2_diameter
.
= ...
mean_sample2_diameter
mean_sample2_diameter
"q4") grader.check(
Question 5
Show side-by-side plots of the distribution of sample values and the distribution of population values, with vertical lines indicating the corresponding mean on each plot.
Hint: copy the cell that produced this plot in scenario 1 and replace sample
with sample2
. Utilizing different methods is also welcome.
# base layer
= ...
base_samp
# histogram of diameter measurements
= ...
hist_samp
# vertical line for population mean
= ...
mean_samp
# combine layers
# display
+ mean_samp | hist_pop + mean_pop hist_samp
Assessing bias through simulation
Here you’ll mimic the simulation done in scenario 1 to assess the bias of the sample mean under this new sampling design.
population_mod1.head()
Question 6
Investigate the bias of the sample mean by:
- drawing 1000 samples with observations weighted by inclusion probability;
- storing the collection of sample means from each sample as
samp_means
; - computing the average difference between the sample means and the population mean (in that order!) and storing the result as
avg_diff
.
(Hint: copy the cell that performs this simulation in scenario 1, and be sure to replace population
with population_mod1
and adjust the sampling step to include weights = ...
with the appropriate argument.)
40221) # for reproducibility
np.random.seed(
# number of samples to simulate
# storage for the sample means
# repeatedly sample and store the sample mean in the samp_means array
# bias
avg_diff
"q6") grader.check(
Question 7
Does this sampling design seem to introduce bias? If so, does the sample mean tend to over-estimate or under-estimate the population mean?
Type your answer here, replacing this text.
Scenario 2: hawks
In this scenario, you’ll explore sampling from a population with group structure – frequently bias can arise from inadvertent uneven sampling of groups within a population.
Hypothetical population
Suppose you’re interested in determining the average beak-to-tail length of red-tailed hawks to help differentiate them from other hawks by sight at a distance. Females and males differ slightly in length – females are generally larger than males. The cell below generates length measurements for a hypothetical population of 3000 females and 2000 males.
# for reproducibility
40721)
np.random.seed(
# simulate hypothetical population
= pd.DataFrame(
female_hawks = {'length': np.random.normal(loc = 57.5, scale = 3, size = 3000),
data 'sex': np.repeat('female', 3000)}
)
= pd.DataFrame(
male_hawks = {'length': np.random.normal(loc = 50.5, scale = 3, size = 2000),
data 'sex': np.repeat('male', 2000)}
)
= pd.concat([female_hawks, male_hawks], axis = 0)
population_hawks
# preview
'sex').head(2) population_hawks.groupby(
The cell below produces a histogram of the lengths in the population overall (bottom panel) and when distinguished by sex (top panel).
= alt.Chart(population_hawks).properties(height = 200)
base
= base.mark_bar(opacity = 0.5, color = 'red').encode(
hist = alt.X('length',
x bin = alt.Bin(maxbins = 40),
= alt.Scale(domain = (40, 70)),
scale = 'length (cm)'),
title = alt.Y('count()',
y = None,
stack = 'number of birds')
title
)
= hist.encode(color = 'sex').properties(height = 100)
hist_bysex
& hist hist_bysex
The population mean – average length of both female and male red-tailed hawks – is shown below.
# population mean
= True) population_hawks.mean(numeric_only
First try drawing a random sample from the population:
# for reproducibility
40821)
np.random.seed(
# randomly sample
= population_hawks.sample(n = 300, replace = False) sample_hawks
Question 8
Do you expect that the sample will contain equal numbers of male and female hawks? Think about this for a moment (you don’t have to provide a written answer), and then compute the proportions of individuals in the sample of each sex and store the result as a dataframe named proportion_hawks_sample
. The dataframe should have one column named proportion
and two rows indexed by sex
.
Hint: group by sex, use .count()
, and divide by the sample size. Be sure to rename the output column appropriately, as the default behavior produces a column called length
.
= ...
proportion_hawks_sample
proportion_hawks_sample
"q8") grader.check(
The sample mean is shown below, and is fairly close to the population mean. This should be expected, since you already saw in scenario 1 that random sampling is an unbiased sampling design with respect to the mean.
= True) sample_hawks.mean(numeric_only
Biased sampling
Let’s now consider a biased sampling design. Usually, length measurements are collected from dead specimens collected opportunistically. Imagine that male mortality is higher, so there are better chances of finding dead males than dead females. Suppose in particular that specimens are five times as likely to be male; to represent this situation, we’ll assign sampling weights of 5/6 to all male hawks and weights of 1/6 to all female hawks.
def weight_fn(sex, p = 5/6):
if sex == 'male':
= p
out else:
= 1 - p
out return out
= pd.DataFrame(
weight_df 'length': [50.5, 57.5],
{'weight': [5/6, 1/6],
'sex': ['male', 'female']})
= alt.Chart(weight_df).mark_bar(opacity = 0.5).encode(
wt = alt.X('length', scale = alt.Scale(domain = (40, 70))),
x = alt.Y('weight', scale = alt.Scale(domain = (0, 1))),
y = 'sex'
color = 70)
).properties(height
& wt hist_bysex
Question 9
Draw a weighted sample sample_hawks_biased
from the population population_hawks
using the weights defined by weight_fn
, and compute and store the value of the sample mean as sample_hawks_biased_mean
.
# for reproducibility
40821)
np.random.seed(
# assign weights
# randomly sample
# compute mean
= ...
sample_hawks_biased_mean
sample_hawks_biased_mean
"q9") grader.check(
Question 10
Investigate the bias of the sample mean by:
- drawing 1000 samples with observations weighted by
weight_fn
; - storing the sample mean from each sample as
samp_means_hawks
; - computing the average difference between the sample means and the population mean and storing the resulting value as
avg_diff_hawks
.
40221) # for reproducibility
np.random.seed(
# number of samples to simulate
# storage for the sample means
= ...
samp_means_hawks
# repeatedly sample and store the sample mean in the samp_means array
# bias
= ...
avg_diff_hawks
avg_diff_hawks
"q10") grader.check(
Question 11
Reflect a moment on your simulation result in question 3c. If instead female mortality is higher and specimens for measurement are collected opportunistically, as described in the previous sampling design, do you expect that the average length in the sample will be an underestimate or an overestimate of the population mean? Explain why in 1-2 sentences, and carry out a simulation to check your intuition.
Type your answer here, replacing this text.
40221) # for reproducibility
np.random.seed(
# invert weights
# storage for the sample means
# repeatedly sample and store the sample mean
# compute bias
= ...
estimated_bias
estimated_bias
Bias correction
What can be done if a sampling design is biased? Is there any remedy?
You’ve seen some examples above of how bias can arise from a sampling mechanism in which units have unequal chances of being selected in the sample. Ideally, we’d work with random samples all the time, but that’s not very realistic in practice. Fortunately, biased sampling is not a hopeless case – it is possible to apply bias corrections if you have good information about which individuals were more likely to be sampled.
To illustrate how this would work, let’s revisit the biased sampling in scenario 1 – sampling larger eucalyptus seeds more often than smaller ones. Imagine you realize the mistake and conduct a quick experiment with your sieve to determine the proportion of seeds of each size that pass through, and use this to estimate the inclusion probabilities. (To simplify this excercise, we’ll just use sampling weights we defined to calculate the actual inclusion probabilities.)
The cell below generates the population and sample from scenario 2 again:
# simulate seed diameters
40221) # for reproducibility
np.random.seed(= pd.DataFrame(
population3 = {'diameter': np.random.gamma(shape = 2, scale = 1/2, size = 5000),
data 'seed': np.arange(5000)}
'seed')
).set_index(
# probability of inclusion as a function of seed diameter
def weight_fn(x, r = 2, c = 2):
= 1/(1 + np.e**(-r*(x - c)))
out return out
# assign inclusion probability to each seed
'samp_weight'] = weight_fn(population3.diameter)
population3[
# draw weighted sample
40721)
np.random.seed(= population3.sample(n = 250, replace = False, weights = 'samp_weight') sample3
The sample mean and population mean you calculated earlier are shown below:
# print sample and population means
'sample mean': sample3.diameter.mean(), 'population mean': population3.diameter.mean()}) pd.Series({
We can obtain an unbiased estimate of the population mean by computing a weighted average of the diameter measurements instead of the sample average after weighting the measurements in inverse proportion to the sampling weights:
\[\text{weighted average} = \sum_{i = 1}^{250} \underbrace{\left(\frac{w_i^{-1}}{\sum_{j = 1}^{250} w_j^{-1}}\right)}_{\text{bias adjustment}} \times \text{diameter}_i\]
This might look a little complicated, but the idea is simple – the weighted average corrects for bias by simply up-weighting observations with a lower sampling weight and down-weighting observations with a higher sampling weight.
The cell below performs this calcuation.
# compute bias adjustment
'bias_adjustment'] = (sample3.samp_weight**(-1))/(sample3.samp_weight**(-1)).sum()
sample3[
# weight diameter measurements
'weighted_diameter'] = sample3.diameter*sample3.bias_adjustment
sample3[
# sum to compute weighted average
sum() sample3.weighted_diameter.
Notice that the weighted average successfully corrected for the bias:
# print sample and population means
'sample mean': sample3.diameter.mean(),
pd.Series({'weighted average': sample3.weighted_diameter.sum(),
'population mean': population3.diameter.mean()})
Question 12
Perform a bias correction for the hawk scenario, using the actual inclusion probabilities, in the following steps:
- redraw
sample_hawks_biased
from question 9 exactly (use the same RNG seed) and retain theweight
column - compute the ‘raw’ sample mean and store the resulting value as
sample_mean
- compute a bias adjustment factor as illustrated above by normalizing the inverse weights, and append the adjustment factor as a new column in
sample_hawks_biased
namedbias_adjustment
- compute a weighted average length and store the resulting value as as
sample_mean_corrected
Print the raw and corrected sample means.
40821) # for reproducibility
np.random.seed(
# randomly sample
=
sample_hawks_biased
# define bias correction factor
# weight lengths and compute weighted sum
= ...
sample_mean_corrected
# compare with unweighted mean
= ...
sample_mean
# print
print('sample mean', sample_mean)
print('weighted mean', sample_mean_corrected)
"q12") grader.check(
Takeaways
These simulations illustrate through a few simple examples that random sampling – a sampling design where each unit is equally likely to be selected – produces unbiased sample means. That means that ‘typical samples’ will yield sample averages that are close to the population value. By contrast, deviations from random sampling tend to yield biased sample averages – in other words, nonrandom sampling tends to distort the statistical properties of the population in ways that can produce misleading conclusions (if uncorrected).
Here are a few key points to reflect on:
- bias is not a property of an individual sample, but of a sampling design
- unbiased sampling designs tend to produce faithful representations of populations
- but there are no guarantees for individual samples
- if you hadn’t known the population distributions, there would have been no computational method to detect bias
- in practice, it’s necessary to reason about whether the sampling design is sound
- the sample statistic (sample mean) was only reliable when the sampling design was sound
- the quality of data collection is arguably more important for reaching reliable conclusions than the choice of statistic or method of analysis
Submission
- Save the notebook.
- Restart the kernel and run all cells. (CAUTION: if your notebook is not saved, you will lose your work.)
- Carefully look through your notebook and verify that all computations execute correctly and all graphics are displayed clearly. You should see no errors; if there are any errors, make sure to correct them before you submit the notebook.
- Download the notebook as an
.ipynb
file. This is your backup copy. - Export the notebook as PDF and upload to Gradescope.
To double-check your work, the cell below will rerun all of the autograder tests.
grader.check_all()