PSTAT100 - Multiple regression

Announcements

No class Monday 5/29 due to Memorial Day.

Assignments due Tuesday instead.
Late deadline moved to Thursday.
Wednesday office hour cancelled on 5/31.

Loose ends

The standard measure of predictive accuracy in regression is mean square error (MSE):

estimates expected squared error,
biased (underestimate on average) if fitted values are used
unbiased if new observations are used

To avoid bias, it is common practice to partition data into nonoverlapping subsets:

one used to fit the model (‘training’ partition)
another used to evaluate predictions (‘testing’ or ‘validation’ partition)

Measuring predictive accuracy

Partition the data:

# hold out 100 randomly selected rows
np.random.seed(51823)
idx = np.random.choice(regdata.index.values, size = 100, replace = False).tolist()
test = regdata.loc[idx]
train = regdata.drop(index = idx)

Fit to the training partition:

# fit model to training subset
x_train = sm.tools.add_constant(train.log_income)
y_train = train.gap
slr = sm.OLS(endog = y_train, exog = x_train)

Evaluate on the test/validation partition:

# compute predictions
x_test = sm.tools.add_constant(test.log_income)
preds = slr.fit().get_prediction(x_test)
y_hat = preds.predicted_mean

# mean square error
pred_errors = test.gap - y_hat
mse = (pred_errors**2).mean()
print('root mean square error: ', np.sqrt(mse))

root mean square error:  0.14377667235816582

Interpreting MSE

Interpretation:

The model predictions vary about observed values with a standard deviation of 0.144 (SD of national average).

(Don’t use) training MSE

Compare with MSE computed using fitted values:

Code

# note, these are just model residuals
fit_errors = train.gap - slr.fit().fittedvalues

# training rmse
print('rmse on training partition: ',np.sqrt((fit_errors**2).mean()))
print('rmse on test partition: ', np.sqrt(mse))

rmse on training partition:  0.10843963485562962
rmse on test partition:  0.14377667235816582

training MSE is overly ‘optimistic’ – smaller than the proper estimate
won’t always be the case, but will be an underestimate on average across samples

Note also that this is simply the estimate of the error variance, rescaled by .

n, p = train.shape
np.sqrt(slr.fit().scale*(n - 2)/n)

0.10843963485562962

Since the model is fit by minimizing this quantity, out-of-sample predictions are absolutely necessary to get a good sense of the predictive reliability.

Multiple regression

The simple linear regression model has just one explanatory variable:

The multiple linear regression model is a direct extension of the simple linear model to variables :

is the number of (mean) parameters
is SLR
is MLR
What’s ??

The model in matrix form

The model in matrix form is , where:

Carrying out the arithmetic on the right-hand side:

This is exactly the model relationship as written before, enumerated for each .

Example

Let’s consider the model you fit in lab:

x_vars = fertility.loc[:, ['educ_expected_yrs_f', 'hdi']]
x = sm.tools.add_constant(x_vars)
y = fertility.fertility_total

mlr = sm.OLS(endog = y, exog = x)

Model estimation

Estimation and uncertainty quantification are exactly the same as in the simple linear model.

The least squares estimates are:

# retrieve coefficient estimates
mlr.fit().params

const                  7.960371
educ_expected_yrs_f   -0.201996
hdi                   -4.132623
dtype: float64

This is the unique minimizer of the residual variance:

Model estimation

An estimate of the error variance is:

# retrieve error variance estimate
mlr.fit().scale

0.34690893559915764

There are several alternative ways of writing the estimator .

Variability of estimates

The variances and covariances of the coefficient estimates are found in exactly the same way as before, only now yield a (instead of ) matrix:

Variability of estimates

This matrix is again estimated by plugging in (the estimate) for :

# retrieve parameter variance-covariance estimate
mlr.fit().cov_params()

	const	educ_expected_yrs_f	hdi
const	0.059995	-0.001839	-0.050256
educ_expected_yrs_f	-0.001839	0.001780	-0.025240
hdi	-0.050256	-0.025240	0.462611

Standard errors

The square roots of the diagonal elements give standard errors:

np.sqrt(mlr.fit().cov_params().values.diagonal())

array([0.24493942, 0.042194  , 0.68015545])

Model fit summary

Just as before, a table is helpful:

Code

rslt = mlr.fit()
coef_tbl = pd.DataFrame({'estimate': rslt.params.values,
              'standard error': np.sqrt(rslt.cov_params().values.diagonal())},
              index = x.columns)
coef_tbl.loc['error variance', 'estimate'] = rslt.scale

coef_tbl

	estimate	standard error
const	7.960371	0.244939
educ_expected_yrs_f	-0.201996	0.042194
hdi	-4.132623	0.680155
error variance	0.346909	NaN

Variance explained

The statistic – proportion of variance explained – is computed the same as before:

rslt.rsquared

0.7827796420988886

Interpretation

While the computations are essentially unchanged, the presence of multiple predictors alters the interpretation of the parameters.

Incrementing, say, HDI by one unit to get an estimated change assumes the other variable is held constant:

if education also changes, then the estimated change in fertility is no longer

So now the interpretation is: a 0.1 increase in HDI is associated with an estimated decrease in fertility rate of 0.413, after accounting for education.

Inference

Confidence intervals are the same as before. A 95% interval is:

mlr.fit().conf_int().rename(columns = {0: 'lwr', 1: 'upr'})

	lwr	upr
const	7.475988	8.444753
educ_expected_yrs_f	-0.285437	-0.118554
hdi	-5.477671	-2.787574

Prediction

Ditto predictions.

95% interval for the mean:
95% interval for a predicted observation:

x_new = np.array([1, 10, 0.5])
mlr.fit().get_prediction(x_new).summary_frame()

	mean	mean_se	mean_ci_lower	mean_ci_upper	obs_ci_lower	obs_ci_upper
0	3.874104	0.119361	3.63806	4.110147	2.685664	5.062544

Model visualizations

Visualization gets a bit trickier because of the presence of that second variable. Here I plotted the marginal relationship with HDI for three quantiles of education.

Code

# set coordinates for grid mesh
educ_gridpts = fertility.educ_expected_yrs_f.quantile([0.2, 0.5, 0.8]).values
hdi_gridpts = np.linspace(fertility.hdi.min(), fertility.hdi.max(), num = 100)

# create prediction grid
g1, g2 = np.meshgrid(educ_gridpts, hdi_gridpts)
grid = np.array([g1.ravel(), g2.ravel()]).T
grid_df = pd.DataFrame(grid, columns = ['educ', 'hdi'])

# format for input to get_prediction()
x_pred = sm.tools.add_constant(grid_df)

# compute predictions
pred_df = mlr.fit().get_prediction(x_pred).summary_frame()

# append to grid
pred_df = pd.concat([grid_df, pred_df], axis = 1)

# plot
scatter = alt.Chart(fertility).mark_circle(opacity = 0.5).encode(
    x = alt.X('hdi', 
        scale = alt.Scale(zero = False),
        title = 'Human development index'),
    y = alt.Y('fertility_total',
        title = 'Fertility rate'),
    color = alt.Color('educ_expected_yrs_f',
        title = 'Education')
)

model = alt.Chart(pred_df).mark_line().encode(
    x = 'hdi',
    y = 'mean',
    color = 'educ'
)

(scatter + model).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
).configure_legend(
    labelFontSize = 14,
    titleFontSize = 16
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Urban tree cover data

Tree canopy is thought to reduce summer temperatures in urban areas. Consider tree cover and a few other variables measured on a random sample of ~2K census blocks with some canpoy cover in San Diego:

Code

trees.head(3)

	census_block_GEOID	tree_cover	mean_summer_temp	mean_income	pop_density
20504	6.073020e+13	25.798277	33.859524	66738	very low
20849	6.073000e+13	11.439114	32.150000	63189	very low
10621	6.073020e+13	26.661660	32.404167	35099	high

McDonald RI, Biswas T, Sachar C, Housman I, Boucher TM, Balk D, et al. (2021) The tree cover and temperature disparity in US urbanized areas: Quantifying the association with income across 5,723 communities. PLoS ONE 16(4). doi:10.1371/journal.pone.0249715.

Simple regression analysis

Is higher tree cover associated with lower summer temperatures?

Code

alt.Chart(trees).mark_circle(opacity = 0.3).encode(
    x = alt.X('tree_cover', 
        title = 'Tree canopy cover (percent)'),
    y = alt.Y('mean_summer_temp', 
        scale = alt.Scale(zero = False),
        title = 'Average summer temperature (C)')
).configure_axis(
    titleFontSize = 16,
    labelFontSize = 14
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

weak negative association
linear approximation seems reasonable
lots of census blocks with low (<10%) cover

Simple regression analysis

We can use SLR to quantify the association:

# explanatory and response variables
x = sm.tools.add_constant(trees.tree_cover)
y = trees.mean_summer_temp

# fit temp ~ cover
slr = sm.OLS(endog = y, exog = x)
slr.fit().params

const         34.994336
tree_cover    -0.073114
dtype: float64

Each 13.7% increase in tree canopy cover is associated with an estimated 1-degree decrease in mean summer surface temperatures.

(How did I get 13.7%?)

Simple regression analysis

Confidence intervals:

Code

slr.fit().conf_int().rename(columns = {0: 'lwr', 1: 'upr'})

	lwr	upr
const	34.876501	35.112170
tree_cover	-0.083032	-0.063195

Interpretation: with 95% confidence, a 10% increase in tree cover is associated with an estimated decrease in mean summer surface temperatures between 0.6 and 0.8 degrees celsius.

Considering more variables

Is tree cover associated with affluence and population density?

Code

alt.Chart(trees).mark_circle(opacity = 0.3).encode(
    x = alt.X('mean_income', 
        scale = alt.Scale(type = 'log'),
        title = 'Mean income (USD)'),
    y = alt.Y('tree_cover', 
        scale = alt.Scale(type = 'log'),
        title = 'Tree cover (percent)')
).configure_axis(
    titleFontSize = 16,
    labelFontSize = 14
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

maybe with income, not clear about density
to answer exactly, want to model tree_cover as a function of mean_income and pop_density

Handling categorical variables

The population density is recorded as a categorical variable:

Code

regdata = trees.loc[:, ['tree_cover', 'mean_income', 'pop_density']]
regdata['log_cover'] = np.log(regdata.tree_cover)
regdata['log_income'] = np.log(regdata.mean_income)
regdata.drop(columns = {'tree_cover', 'mean_income'}, inplace = True)
regdata.head(4)

	pop_density	log_cover	log_income
20504	very low	3.250308	11.108530
20849	very low	2.437039	11.053886
10621	high	3.283227	10.465928
9031	low	2.311344	10.309253

On face value, it might seem we could simply fit this model:

But this doesn’t quite make sense, because the values of would be words!

Indicator variable encoding

The solution to this issue is to encode each level of the categorical variable separately using a set of indicator variables.

For instance, to indicate whether a census block is of low population density, we can use the indicator:

We can encode the levels of pop_density using a collection of indicators:

Code

# create indicator variables
density_encoded = pd.get_dummies(regdata.pop_density, drop_first = True)

# display example values corresponding to categorical variable
pd.concat([regdata[['pop_density']], density_encoded], axis = 1).groupby('pop_density').head(2)

	pop_density	low	medium	high
20504	very low	0	0	0
20849	very low	0	0	0
10621	high	0	0	1
9031	low	1	0	0
1004	high	0	0	1
15939	low	1	0	0
6647	medium	0	1	0
17963	medium	0	1	0

The MLR model with indicators

The model with the encoded population density variable is:

The effect is that the model has different intercepts for each population density.

In matrix form

The explanatory variable matrix for this model will be of the form:

Code

# form explanatory variable matrix
x_vars = pd.concat([regdata.log_income, density_encoded], axis = 1)
x = sm.tools.add_constant(x_vars)
x.head(4)

	const	log_income	low	high
20504	1.0	11.108530	0	0
20849	1.0	11.053886	0	0
10621	1.0	10.465928	0	1
9031	1.0	10.309253	1	0

Fit summary

The remaining calculations are all the same as before. Here is the model fit summary:

Code

# form explanatory variable matrix
x_vars = pd.concat([regdata.log_income, density_encoded], axis = 1)
x = sm.tools.add_constant(x_vars)
y = regdata.log_cover

# fit model
mlr = sm.OLS(endog = y, exog = x)
rslt = mlr.fit()

# summarize fit
coef_tbl = pd.DataFrame({
    'estimate': rslt.params,
    'standard error': np.sqrt(rslt.cov_params().values.diagonal())
}, index = x.columns.values)
coef_tbl.loc['error variance', 'estimate'] = rslt.scale

coef_tbl

	estimate	standard error
const	-4.751870	0.621673
log_income	0.631469	0.058290
low	-0.414486	0.070483
medium	-0.675626	0.079552
high	-0.606980	0.101708
error variance	1.552578	NaN

income is positively associated with tree cover
but how do you interpret the coefficients for the indicator variables?

Interpreting indicator coefficients

Each level of the categorical variable has its own intercept:

is the intercept when density is very low – this gets called the reference level
are interpreted relative to the reference level:
- is the difference in expected log cover between low density and very low density blocks, after accounting for income
- is the difference in expected log cover between medium density and very low density blocks, after accounting for income
- is the difference in expected log cover between high density and very low density blocks, after accounting for income

Interpreting estimates

	estimate	standard error
const	-4.751870	0.621673
log_income	0.631469	0.058290
low	-0.414486	0.070483
medium	-0.675626	0.079552
high	-0.606980	0.101708
error variance	1.552578	NaN

each doubling of mean income is associated with an estimated 55% increase in median tree cover, after accounting for population density
census blocks with higher population densities are estimated as having a median tree canopy up to 50% lower than census blocks with very low population densities, after accounting for mean income

Prediction

Prediction works the same way, but we need to supply the indicator encoding rather than the categorical variable level.

x_new = np.array([1, np.log(115000), 0, 1, 0])
pred = rslt.get_prediction(x_new)
np.exp(pred.summary_frame())

	mean	mean_se	mean_ci_lower	mean_ci_upper	obs_ci_lower	obs_ci_upper
0	6.895105	1.102304	5.696153	8.346419	0.594349	79.990893

Check your understanding and fill in the blanks:

the median tree cover for a _______ density census block with mean income _______ is estimated to be between _______ and _______ percent
the tree cover for a _______ density census block with mean income _______ is estimated to be between _______ and _______ percent

Comments on scope of inference

The data in this case study are from a random sample of census blocks in the San Diego urban area.

They are therefore representative of all census blocks in the San Diego urban area.

The model approximates the actual associations between summer temperatures, tree cover, income, and population density in the region.