Invalid Date
No class Monday 5/29 due to Memorial Day.
The standard measure of predictive accuracy in regression is mean square error (MSE):
MSE(y,ˆy)=1n∑i(yi−ˆyi)2
To avoid bias, it is common practice to partition data into nonoverlapping subsets:
Partition the data:
Fit to the training partition:
Evaluate on the test/validation partition:
# compute predictions
x_test = sm.tools.add_constant(test.log_income)
preds = slr.fit().get_prediction(x_test)
y_hat = preds.predicted_mean
# mean square error
pred_errors = test.gap - y_hat
mse = (pred_errors**2).mean()
print('root mean square error: ', np.sqrt(mse))
root mean square error: 0.14377667235816582
Interpretation:
The model predictions vary about observed values with a standard deviation of 0.144 (SD of national average).
Compare with MSE computed using fitted values:
rmse on training partition: 0.10843963485562962
rmse on test partition: 0.14377667235816582
Note also that this is simply the estimate of the error variance, rescaled by n−2n.
Since the model is fit by minimizing this quantity, out-of-sample predictions are absolutely necessary to get a good sense of the predictive reliability.
The simple linear regression model has just one explanatory variable:
(SLR)yi=β0+β1xi+ϵi{i=1,…,nϵi∼N(0,σ2)
The multiple linear regression model is a direct extension of the simple linear model to p−1 variables xi1,…,xi,p−1:
(MLR)yi=β0+β1xi1+⋯+βp−1xi,p−1+ϵi{ϵi∼N(0,σ2)i=1,…,n
The model in matrix form is y=Xβ+ϵ, where:
y:[y1y2⋮yn]=X:[1x11⋯x1,p−11x21⋯x2,p−1⋮⋮⋱⋮1xn1⋯xn,p−1]×β:[β0β1⋮βp−1]+ϵ:[ϵ1ϵ2⋮ϵn]
Carrying out the arithmetic on the right-hand side:
[y1y2⋮yn]n×1=[β0+β1x11+⋯+βp−1x1,p−1+ϵ1β0+β1x21+⋯+βp−1x2,p−1+ϵ2⋮β0+β1xn1+⋯+βp−1xn,p−1+ϵn]n×1
This is exactly the model relationship as written before, enumerated for each i.
Let’s consider the model you fit in lab:
(fertility rate)i=β0+β1(HDI)i+β2(education)i+ϵi
Estimation and uncertainty quantification are exactly the same as in the simple linear model.
The least squares estimates are: ˆβ=[ˆβ0⋮ˆβp−1]=(X′X)−1X′y
This is the unique minimizer of the residual variance:
ˆβ=argminβ∈Rp{(y−Xβ)′(y−Xβ)}
An estimate of the error variance is: ˆσ2=1n−pn∑i=1(yi−ˆβ0−ˆβ1xi1−⋯−ˆβp−1xi,p−1)2⏟ith squared model residual
There are several alternative ways of writing the estimator ˆσ2.
The variances and covariances of the coefficient estimates are found in exactly the same way as before, only now yield a p×p (instead of 2×2) matrix:
V=[varˆβ0cov(ˆβ0,ˆβ1)⋯cov(ˆβ0,ˆβp−1)cov(ˆβ0,ˆβ1)varˆβ1⋯cov(ˆβ1,ˆβp−1)⋮⋮⋱⋮cov(ˆβ0,ˆβp−1)cov(ˆβ1,ˆβp−1)⋯varˆβp−1]=σ2(X′X)−1
This matrix is again estimated by plugging in ˆσ2 (the estimate) for σ2: ˆV=[ˆv11ˆv12⋯ˆv1pˆv21ˆv22⋯ˆv2p⋮⋮⋱⋮ˆvp1ˆvp2⋯ˆvpp]=ˆσ2(X′X)−1
The square roots of the diagonal elements give standard errors: SE(ˆβ0)=√ˆv11,SE(ˆβ1)=√ˆv22,⋯SE(ˆβp−1)=√ˆvpp
Just as before, a table is helpful:
estimate | standard error | |
---|---|---|
const | 7.960371 | 0.244939 |
educ_expected_yrs_f | -0.201996 | 0.042194 |
hdi | -4.132623 | 0.680155 |
error variance | 0.346909 | NaN |
The R2 statistic – proportion of variance explained – is computed the same as before: ˆσ2raw−n−1n−pˆσ2ˆσ2raw
While the computations are essentially unchanged, the presence of multiple predictors alters the interpretation of the parameters.
Incrementing, say, HDI by one unit to get an estimated change assumes the other variable is held constant: E(fertility)+β2=β0+β1(education)+β2[(HDI)+1]
So now the interpretation is: a 0.1 increase in HDI is associated with an estimated decrease in fertility rate of 0.413, after accounting for education.
Confidence intervals are the same as before. A 95% interval is:
ˆβj±2SE(ˆβj)
Ditto predictions.
Visualization gets a bit trickier because of the presence of that second variable. Here I plotted the marginal relationship with HDI for three quantiles of education.
# set coordinates for grid mesh
educ_gridpts = fertility.educ_expected_yrs_f.quantile([0.2, 0.5, 0.8]).values
hdi_gridpts = np.linspace(fertility.hdi.min(), fertility.hdi.max(), num = 100)
# create prediction grid
g1, g2 = np.meshgrid(educ_gridpts, hdi_gridpts)
grid = np.array([g1.ravel(), g2.ravel()]).T
grid_df = pd.DataFrame(grid, columns = ['educ', 'hdi'])
# format for input to get_prediction()
x_pred = sm.tools.add_constant(grid_df)
# compute predictions
pred_df = mlr.fit().get_prediction(x_pred).summary_frame()
# append to grid
pred_df = pd.concat([grid_df, pred_df], axis = 1)
# plot
scatter = alt.Chart(fertility).mark_circle(opacity = 0.5).encode(
x = alt.X('hdi',
scale = alt.Scale(zero = False),
title = 'Human development index'),
y = alt.Y('fertility_total',
title = 'Fertility rate'),
color = alt.Color('educ_expected_yrs_f',
title = 'Education')
)
model = alt.Chart(pred_df).mark_line().encode(
x = 'hdi',
y = 'mean',
color = 'educ'
)
(scatter + model).configure_axis(
labelFontSize = 14,
titleFontSize = 16
).configure_legend(
labelFontSize = 14,
titleFontSize = 16
)
Tree canopy is thought to reduce summer temperatures in urban areas. Consider tree cover and a few other variables measured on a random sample of ~2K census blocks with some canpoy cover in San Diego:
census_block_GEOID | tree_cover | mean_summer_temp | mean_income | pop_density | |
---|---|---|---|---|---|
20504 | 6.073020e+13 | 25.798277 | 33.859524 | 66738 | very low |
20849 | 6.073000e+13 | 11.439114 | 32.150000 | 63189 | very low |
10621 | 6.073020e+13 | 26.661660 | 32.404167 | 35099 | high |
McDonald RI, Biswas T, Sachar C, Housman I, Boucher TM, Balk D, et al. (2021) The tree cover and temperature disparity in US urbanized areas: Quantifying the association with income across 5,723 communities. PLoS ONE 16(4). doi:10.1371/journal.pone.0249715.
Is higher tree cover associated with lower summer temperatures?
We can use SLR to quantify the association:
# explanatory and response variables
x = sm.tools.add_constant(trees.tree_cover)
y = trees.mean_summer_temp
# fit temp ~ cover
slr = sm.OLS(endog = y, exog = x)
slr.fit().params
const 34.994336
tree_cover -0.073114
dtype: float64
Each 13.7% increase in tree canopy cover is associated with an estimated 1-degree decrease in mean summer surface temperatures.
(How did I get 13.7%?)
Confidence intervals:
lwr | upr | |
---|---|---|
const | 34.876501 | 35.112170 |
tree_cover | -0.083032 | -0.063195 |
Interpretation: with 95% confidence, a 10% increase in tree cover is associated with an estimated decrease in mean summer surface temperatures between 0.6 and 0.8 degrees celsius.
Is tree cover associated with affluence and population density?
tree_cover
as a function of mean_income
and pop_density
The population density is recorded as a categorical variable:
pop_density | log_cover | log_income | |
---|---|---|---|
20504 | very low | 3.250308 | 11.108530 |
20849 | very low | 2.437039 | 11.053886 |
10621 | high | 3.283227 | 10.465928 |
9031 | low | 2.311344 | 10.309253 |
On face value, it might seem we could simply fit this model: log(coveri)⏟yi=β0+β1log(incomei)⏟xi1+β2densityi⏟xi2+ϵii=1,…,2000
But this doesn’t quite make sense, because the values of densityi would be words!
The solution to this issue is to encode each level of the categorical variable separately using a set of indicator variables.
For instance, to indicate whether a census block is of low population density, we can use the indicator:
I(density = low)={1 if population density is low0 otherwise
We can encode the levels of pop_density
using a collection of indicators:
pop_density | low | medium | high | |
---|---|---|---|---|
20504 | very low | 0 | 0 | 0 |
20849 | very low | 0 | 0 | 0 |
10621 | high | 0 | 0 | 1 |
9031 | low | 1 | 0 | 0 |
1004 | high | 0 | 0 | 1 |
15939 | low | 1 | 0 | 0 |
6647 | medium | 0 | 1 | 0 |
17963 | medium | 0 | 1 | 0 |
The model with the encoded population density variable is:
log(coveri)⏟yi=β0+β1log(incomei)⏟xi1+β2lowi⏟xi2+β3medi⏟xi3+β4highi⏟xi4+ϵi
The effect is that the model has different intercepts for each population density.
density = very low⟹Elog(coveri)=β0⏟intercept+β1log(incomei)density = low⟹Elog(coveri)=β0+β2⏟intercept+β1log(incomei)density = medium⟹Elog(coveri)=β0+β3⏟intercept+β1log(incomei)density = high⟹Elog(coveri)=β0+β4⏟intercept+β1log(incomei)
The explanatory variable matrix X for this model will be of the form:
X=[1log(income1)low1med1high11log(income2)low2med2high2⋮⋮⋮⋮⋮1log(income2000)low2000med2000high2000]
The remaining calculations are all the same as before. Here is the model fit summary:
# form explanatory variable matrix
x_vars = pd.concat([regdata.log_income, density_encoded], axis = 1)
x = sm.tools.add_constant(x_vars)
y = regdata.log_cover
# fit model
mlr = sm.OLS(endog = y, exog = x)
rslt = mlr.fit()
# summarize fit
coef_tbl = pd.DataFrame({
'estimate': rslt.params,
'standard error': np.sqrt(rslt.cov_params().values.diagonal())
}, index = x.columns.values)
coef_tbl.loc['error variance', 'estimate'] = rslt.scale
coef_tbl
estimate | standard error | |
---|---|---|
const | -4.751870 | 0.621673 |
log_income | 0.631469 | 0.058290 |
low | -0.414486 | 0.070483 |
medium | -0.675626 | 0.079552 |
high | -0.606980 | 0.101708 |
error variance | 1.552578 | NaN |
Each level of the categorical variable has its own intercept:
very low density:Elog(cover)=β0+β1log(income)low density:Elog(cover)=(β0+β2)+β1log(income)medium density:Elog(cover)=(β0+β3)+β1log(income)high density:Elog(cover)=(β0+β4)+β1log(income)
estimate | standard error | |
---|---|---|
const | -4.751870 | 0.621673 |
log_income | 0.631469 | 0.058290 |
low | -0.414486 | 0.070483 |
medium | -0.675626 | 0.079552 |
high | -0.606980 | 0.101708 |
error variance | 1.552578 | NaN |
Prediction works the same way, but we need to supply the indicator encoding rather than the categorical variable level.
x_new = np.array([1, np.log(115000), 0, 1, 0])
pred = rslt.get_prediction(x_new)
np.exp(pred.summary_frame())
mean | mean_se | mean_ci_lower | mean_ci_upper | obs_ci_lower | obs_ci_upper | |
---|---|---|---|---|---|---|
0 | 6.895105 | 1.102304 | 5.696153 | 8.346419 | 0.594349 | 79.990893 |
Check your understanding and fill in the blanks:
Comments on scope of inference
The data in this case study are from a random sample of census blocks in the San Diego urban area.
They are therefore representative of all census blocks in the San Diego urban area.