Invalid Date
No class Monday 5/29 due to Memorial Day.
The standard measure of predictive accuracy in regression is mean square error (MSE):
\[ MSE(y, \hat{y}) = \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2 \]
To avoid bias, it is common practice to partition data into nonoverlapping subsets:
Partition the data:
Fit to the training partition:
Evaluate on the test/validation partition:
# compute predictions
x_test = sm.tools.add_constant(test.log_income)
preds = slr.fit().get_prediction(x_test)
y_hat = preds.predicted_mean
# mean square error
pred_errors = test.gap - y_hat
mse = (pred_errors**2).mean()
print('root mean square error: ', np.sqrt(mse))
root mean square error: 0.14377667235816582
Interpretation:
The model predictions vary about observed values with a standard deviation of 0.144 (SD of national average).
Compare with MSE computed using fitted values:
rmse on training partition: 0.10843963485562962
rmse on test partition: 0.14377667235816582
Note also that this is simply the estimate of the error variance, rescaled by \(\frac{n - 2}{n}\).
Since the model is fit by minimizing this quantity, out-of-sample predictions are absolutely necessary to get a good sense of the predictive reliability.
The simple linear regression model has just one explanatory variable:
\[ (\text{SLR}) \qquad y_i = \beta_0 + \beta_1 x_i + \epsilon_i \quad\begin{cases} i = 1, \dots, n \\\epsilon_i \sim N\left(0,\sigma^2\right)\end{cases} \]
The multiple linear regression model is a direct extension of the simple linear model to \(p - 1\) variables \(x_{i1}, \dots, x_{i, p - 1}\):
\[ (\text{MLR})\qquad y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p - 1} x_{i, p - 1} + \epsilon_i \qquad\begin{cases} \epsilon_i \sim N(0, \sigma^2) \\ i = 1, \dots, n\end{cases} \]
The model in matrix form is \(\mathbf{y} = \mathbf{X}\beta + \epsilon\), where:
\[ \mathbf{y}: \left[\begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array}\right] \; = \; \mathbf{X}: \left[\begin{array}{cccc} 1 &x_{11} &\cdots &x_{1, p - 1} \\ 1 &x_{21} &\cdots &x_{2, p - 1} \\ \vdots &\vdots &\ddots &\vdots \\ 1 &x_{n1} &\cdots &x_{n, p - 1} \end{array}\right] \; \times \; \beta: \left[\begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{p - 1} \end{array} \right] \; + \; \epsilon: \left[\begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{array}\right] \]
Carrying out the arithmetic on the right-hand side:
\[ \left[\begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array}\right]_{\;n \times 1} \quad = \quad \left[\begin{array}{c} \beta_0 + \beta_1 x_{11} + \cdots + \beta_{p - 1} x_{1, p - 1} + \epsilon_1 \\ \beta_0 + \beta_1 x_{21} + \cdots + \beta_{p - 1} x_{2, p - 1} + \epsilon_2 \\ \vdots \\ \beta_0 + \beta_1 x_{n1} + \cdots + \beta_{p - 1} x_{n, p - 1} + \epsilon_n \end{array}\right]_{\;n \times 1} \]
This is exactly the model relationship as written before, enumerated for each \(i\).
Let’s consider the model you fit in lab:
\[ (\text{fertility rate})_i = \beta_0 + \beta_1 (\text{HDI})_i + \beta_2 (\text{education})_i + \epsilon_i \]
Estimation and uncertainty quantification are exactly the same as in the simple linear model.
The least squares estimates are: \[ \hat{\beta} = \left[\begin{array}{c}\hat{\beta}_0 \\ \vdots \\ \hat{\beta}_{p - 1} \end{array}\right] = \left(\mathbf{X'X}\right)^{-1}\mathbf{X'y} \]
This is the unique minimizer of the residual variance:
\[ \hat{\beta} = \text{argmin}_{\beta \in \mathbb{R}^p}\left\{ (\mathbf{y} - \mathbf{X}\beta)'(\mathbf{y} - \mathbf{X}\beta)\right\} \]
An estimate of the error variance is: \[ \hat{\sigma}^2 = \frac{1}{n - p} \sum_{i = 1}^n \underbrace{\left(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \cdots - \hat{\beta}_{p - 1}x_{i, p - 1}\right)^2}_{i\text{th squared model residual}} \]
There are several alternative ways of writing the estimator \(\hat{\sigma}^2\).
The variances and covariances of the coefficient estimates are found in exactly the same way as before, only now yield a \(p \times p\) (instead of \(2\times 2\)) matrix:
\[ \mathbf{V} = \left[\begin{array}{cccc} \text{var}\hat{\beta}_0 &\text{cov}\left(\hat{\beta}_0, \hat{\beta}_1\right) &\cdots &\text{cov}\left(\hat{\beta}_0, \hat{\beta}_{p - 1}\right) \\ \text{cov}\left(\hat{\beta}_0, \hat{\beta}_1\right) &\text{var}\hat{\beta}_1 &\cdots &\text{cov}\left(\hat{\beta}_1, \hat{\beta}_{p - 1}\right) \\ \vdots &\vdots &\ddots &\vdots \\ \text{cov}\left(\hat{\beta}_0, \hat{\beta}_{p - 1}\right) &\text{cov}\left(\hat{\beta}_1, \hat{\beta}_{p - 1}\right) &\cdots &\text{var}\hat{\beta}_{p - 1} \end{array}\right] = \sigma^2 \left(\mathbf{X'X}\right)^{-1} \]
This matrix is again estimated by plugging in \(\color{red}{\hat{\sigma}^2}\) (the estimate) for \(\sigma^2\): \[\hat{\mathbf{V}} = \left[\begin{array}{cccc} \color{blue}{\hat{v}_{11}} & \hat{v}_{12} & \cdots & \hat{v}_{1p} \\ \hat{v}_{21} & \color{blue}{\hat{v}_{22}} & \cdots & \hat{v}_{2p} \\ \vdots &\vdots &\ddots &\vdots \\ \hat{v}_{p1} & \hat{v}_{p2} &\cdots &\color{blue}{\hat{v}_{pp}} \\ \end{array}\right] = \color{red}{\hat{\sigma}^2}\left(\mathbf{X'X}\right)^{-1} \]
The square roots of the diagonal elements give standard errors: \[ \text{SE}(\hat{\beta}_0) = \sqrt{\color{blue}{\hat{v}_{11}}} \;,\quad \text{SE}(\hat{\beta}_1) = \sqrt{\color{blue}{\hat{v}_{22}}} \;,\quad \cdots \quad \text{SE}(\hat{\beta}_{p - 1}) = \sqrt{\color{blue}{\hat{v}_{pp}}} \]
Just as before, a table is helpful:
estimate | standard error | |
---|---|---|
const | 7.960371 | 0.244939 |
educ_expected_yrs_f | -0.201996 | 0.042194 |
hdi | -4.132623 | 0.680155 |
error variance | 0.346909 | NaN |
The \(R^2\) statistic – proportion of variance explained – is computed the same as before: \[ \frac{\hat{\sigma}_\text{raw}^2 - \frac{n - 1}{n - p}\hat{\sigma}^2}{\hat{\sigma}_\text{raw}^2} \]
While the computations are essentially unchanged, the presence of multiple predictors alters the interpretation of the parameters.
Incrementing, say, HDI by one unit to get an estimated change assumes the other variable is held constant: \[ \mathbb{E}(\text{fertility}) + \beta_2 = \beta_0 + \beta_1 (\text{education}) + \beta_2 \left[(\text{HDI}) + 1\right] \]
So now the interpretation is: a 0.1 increase in HDI is associated with an estimated decrease in fertility rate of 0.413, after accounting for education.
Confidence intervals are the same as before. A 95% interval is:
\[ \hat{\beta}_j \pm 2 SE(\hat{\beta}_j) \]
Ditto predictions.
Visualization gets a bit trickier because of the presence of that second variable. Here I plotted the marginal relationship with HDI for three quantiles of education.
# set coordinates for grid mesh
educ_gridpts = fertility.educ_expected_yrs_f.quantile([0.2, 0.5, 0.8]).values
hdi_gridpts = np.linspace(fertility.hdi.min(), fertility.hdi.max(), num = 100)
# create prediction grid
g1, g2 = np.meshgrid(educ_gridpts, hdi_gridpts)
grid = np.array([g1.ravel(), g2.ravel()]).T
grid_df = pd.DataFrame(grid, columns = ['educ', 'hdi'])
# format for input to get_prediction()
x_pred = sm.tools.add_constant(grid_df)
# compute predictions
pred_df = mlr.fit().get_prediction(x_pred).summary_frame()
# append to grid
pred_df = pd.concat([grid_df, pred_df], axis = 1)
# plot
scatter = alt.Chart(fertility).mark_circle(opacity = 0.5).encode(
x = alt.X('hdi',
scale = alt.Scale(zero = False),
title = 'Human development index'),
y = alt.Y('fertility_total',
title = 'Fertility rate'),
color = alt.Color('educ_expected_yrs_f',
title = 'Education')
)
model = alt.Chart(pred_df).mark_line().encode(
x = 'hdi',
y = 'mean',
color = 'educ'
)
(scatter + model).configure_axis(
labelFontSize = 14,
titleFontSize = 16
).configure_legend(
labelFontSize = 14,
titleFontSize = 16
)
Tree canopy is thought to reduce summer temperatures in urban areas. Consider tree cover and a few other variables measured on a random sample of ~2K census blocks with some canpoy cover in San Diego:
census_block_GEOID | tree_cover | mean_summer_temp | mean_income | pop_density | |
---|---|---|---|---|---|
20504 | 6.073020e+13 | 25.798277 | 33.859524 | 66738 | very low |
20849 | 6.073000e+13 | 11.439114 | 32.150000 | 63189 | very low |
10621 | 6.073020e+13 | 26.661660 | 32.404167 | 35099 | high |
McDonald RI, Biswas T, Sachar C, Housman I, Boucher TM, Balk D, et al. (2021) The tree cover and temperature disparity in US urbanized areas: Quantifying the association with income across 5,723 communities. PLoS ONE 16(4). doi:10.1371/journal.pone.0249715.
Is higher tree cover associated with lower summer temperatures?
We can use SLR to quantify the association:
# explanatory and response variables
x = sm.tools.add_constant(trees.tree_cover)
y = trees.mean_summer_temp
# fit temp ~ cover
slr = sm.OLS(endog = y, exog = x)
slr.fit().params
const 34.994336
tree_cover -0.073114
dtype: float64
Each 13.7% increase in tree canopy cover is associated with an estimated 1-degree decrease in mean summer surface temperatures.
(How did I get 13.7%?)
Confidence intervals:
lwr | upr | |
---|---|---|
const | 34.876501 | 35.112170 |
tree_cover | -0.083032 | -0.063195 |
Interpretation: with 95% confidence, a 10% increase in tree cover is associated with an estimated decrease in mean summer surface temperatures between 0.6 and 0.8 degrees celsius.
Is tree cover associated with affluence and population density?
tree_cover
as a function of mean_income
and pop_density
The population density is recorded as a categorical variable:
pop_density | log_cover | log_income | |
---|---|---|---|
20504 | very low | 3.250308 | 11.108530 |
20849 | very low | 2.437039 | 11.053886 |
10621 | high | 3.283227 | 10.465928 |
9031 | low | 2.311344 | 10.309253 |
On face value, it might seem we could simply fit this model: \[ \underbrace{\log(\text{cover}_i)}_{y_i} = \beta_0 + \beta_1 \underbrace{\log(\text{income}_i)}_{x_{i1}} + \beta_2 \underbrace{\text{density}_i}_{x_{i2}} + \epsilon_i \qquad i = 1, \dots, 2000 \]
But this doesn’t quite make sense, because the values of \(\text{density}_i\) would be words!
The solution to this issue is to encode each level of the categorical variable separately using a set of indicator variables.
For instance, to indicate whether a census block is of low population density, we can use the indicator:
\[ I(\text{density $=$ low}) = \begin{cases} 1 \text{ if population density is low} \\ 0 \text{ otherwise}\end{cases} \]
We can encode the levels of pop_density
using a collection of indicators:
pop_density | low | medium | high | |
---|---|---|---|---|
20504 | very low | 0 | 0 | 0 |
20849 | very low | 0 | 0 | 0 |
10621 | high | 0 | 0 | 1 |
9031 | low | 1 | 0 | 0 |
1004 | high | 0 | 0 | 1 |
15939 | low | 1 | 0 | 0 |
6647 | medium | 0 | 1 | 0 |
17963 | medium | 0 | 1 | 0 |
The model with the encoded population density variable is:
\[ \underbrace{\log(\text{cover}_i)}_{y_i} = \beta_0 + \beta_1 \underbrace{\log(\text{income}_i)}_{x_{i1}} + \beta_2 \underbrace{\text{low}_i}_{x_{i2}} + \beta_3 \underbrace{\text{med}_i}_{x_{i3}} + \beta_4 \underbrace{\text{high}_i}_{x_{i4}} + \epsilon_i \]
The effect is that the model has different intercepts for each population density.
\[ \begin{align*} \text{density $=$ very low} &\quad\Longrightarrow\quad \mathbb{E}\log(\text{cover}_i) = \underbrace{\beta_0}_\text{intercept} + \beta_1\log(\text{income}_i) \\ \text{density $=$ low} &\quad\Longrightarrow\quad \mathbb{E}\log(\text{cover}_i) = \underbrace{\beta_0 + \beta_2}_\text{intercept} + \beta_1\log(\text{income}_i) \\ \text{density $=$ medium} &\quad\Longrightarrow\quad \mathbb{E}\log(\text{cover}_i) = \underbrace{\beta_0 + \beta_3}_\text{intercept} + \beta_1\log(\text{income}_i) \\ \text{density $=$ high} &\quad\Longrightarrow\quad \mathbb{E}\log(\text{cover}_i) = \underbrace{\beta_0 + \beta_4}_\text{intercept} + \beta_1\log(\text{income}_i) \end{align*} \]
The explanatory variable matrix \(\mathbf{X}\) for this model will be of the form:
\[ \mathbf{X} = \left[\begin{array}{c:cccc} 1 &\log(\text{income}_{1}) &\text{low}_1 &\text{med}_1 &\text{high}_1 \\ 1 &\log(\text{income}_{2}) &\text{low}_2 &\text{med}_2 &\text{high}_2 \\ \vdots &\vdots &\vdots &\vdots &\vdots \\ 1 &\log(\text{income}_{2000}) &\text{low}_{2000} &\text{med}_{2000} &\text{high}_{2000} \end{array}\right] \]
The remaining calculations are all the same as before. Here is the model fit summary:
# form explanatory variable matrix
x_vars = pd.concat([regdata.log_income, density_encoded], axis = 1)
x = sm.tools.add_constant(x_vars)
y = regdata.log_cover
# fit model
mlr = sm.OLS(endog = y, exog = x)
rslt = mlr.fit()
# summarize fit
coef_tbl = pd.DataFrame({
'estimate': rslt.params,
'standard error': np.sqrt(rslt.cov_params().values.diagonal())
}, index = x.columns.values)
coef_tbl.loc['error variance', 'estimate'] = rslt.scale
coef_tbl
estimate | standard error | |
---|---|---|
const | -4.751870 | 0.621673 |
log_income | 0.631469 | 0.058290 |
low | -0.414486 | 0.070483 |
medium | -0.675626 | 0.079552 |
high | -0.606980 | 0.101708 |
error variance | 1.552578 | NaN |
Each level of the categorical variable has its own intercept:
\[ \begin{align*} \text{very low density:}\quad &\mathbb{E}\log(\text{cover}) = \beta_0 + \beta_1 \log(\text{income}) \\ \text{low density:}\quad &\mathbb{E}\log(\text{cover}) = (\beta_0 + \beta_2) + \beta_1 \log(\text{income}) \\ \text{medium density:}\quad &\mathbb{E}\log(\text{cover}) = (\beta_0 + \beta_3) + \beta_1 \log(\text{income}) \\ \text{high density:}\quad &\mathbb{E}\log(\text{cover}) = (\beta_0 + \beta_4) + \beta_1 \log(\text{income}) \end{align*} \]
estimate | standard error | |
---|---|---|
const | -4.751870 | 0.621673 |
log_income | 0.631469 | 0.058290 |
low | -0.414486 | 0.070483 |
medium | -0.675626 | 0.079552 |
high | -0.606980 | 0.101708 |
error variance | 1.552578 | NaN |
Prediction works the same way, but we need to supply the indicator encoding rather than the categorical variable level.
x_new = np.array([1, np.log(115000), 0, 1, 0])
pred = rslt.get_prediction(x_new)
np.exp(pred.summary_frame())
mean | mean_se | mean_ci_lower | mean_ci_upper | obs_ci_lower | obs_ci_upper | |
---|---|---|---|---|---|---|
0 | 6.895105 | 1.102304 | 5.696153 | 8.346419 | 0.594349 | 79.990893 |
Check your understanding and fill in the blanks:
Comments on scope of inference
The data in this case study are from a random sample of census blocks in the San Diego urban area.
They are therefore representative of all census blocks in the San Diego urban area.