PSTAT100 - Classification

Announcements

MP2 due today; late deadline is Friday
HW4 due next Wednesday (not Monday); late deadline is Friday
No OH today
lecture cancelled Monday 6/5 for data science event

Data science event 6/5

Loma Pelona Center 1pm – 4pm

1pm: Graduate school presentation and panel discussion with recent students
2pm: capstone project showcase
3pm: Careers in data science panel discussion

From last time

We fit this model to the tree cover data:

Each level of population density has its own intercept:

represent the difference in expected log cover between very low density and low, medium, high density after accounting for income

Interpreting estimates

	estimate	standard error
const	-4.751870	0.621673
log_income	0.631469	0.058290
low	-0.414486	0.070483
medium	-0.675626	0.079552
high	-0.606980	0.101708
error variance	1.552578	NaN

each doubling of mean income is associated with an estimated 55% increase in median tree cover, after accounting for population density
census blocks with higher population densities are estimated as having a median tree canopy up to 50% lower than census blocks with very low population densities, after accounting for mean income

On log-transforming the response

The model is ; so is what’s known as a lognormal random variable.

From the properties of the lognormal distribution:

So when parameters are back-transformed, they should be interpreted in terms of the median response.

Interpretations, again

Median cover increases by a factor of 1.549, i.e., increases by 54.9%:

doubling income increments log income by
gives the associated change in mean log cover
exponentiating the change in mean log cover gives the multiplicative change in median cover

Prediction

x_new = np.array([1, np.log(115000), 0, 1, 0])
pred = rslt.get_prediction(x_new)
np.exp(pred.summary_frame())

	mean	mean_se	mean_ci_lower	mean_ci_upper	obs_ci_lower	obs_ci_upper
0	6.895105	1.102304	5.696153	8.346419	0.594349	79.990893

Fill in the blanks:

the median tree cover for a _______ density census block with mean income _______ is estimated to be between _______ and _______ percent
the tree cover for a _______ density census block with mean income _______ is estimated to be between _______ and _______ percent

Model checking

The linearity and constant variance assumptions can be assessed by plotting residuals against fitted values:

Code

alt.Chart(pd.DataFrame({
    'fitted': rslt.fittedvalues,
    'resid': rslt.resid
})).mark_circle().encode(
    x = 'fitted',
    y = 'resid'
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Should see minimal pattern:

centered at zero
even spread in either direction

Diabetes data

4831 responses from the 2011-2012 National Health and Nutrition Examination Survey (NHANES):

	Gender	Age	BMI	Diabetes
0	male	14	17.3	No
1	female	43	33.3	No
2	male	80	33.9	No
3	male	80	33.9	No

Is BMI a risk factor for diabetes after adjusting for age and sex?

Model sketch

Broadly, we can answer the question by estimating the dependence of diabetes status on age, sex, and BMI.

An additive model might look something like this:

To answer the question, fit the model and examine .

Binary response

Note that the response variable – whether the respondent has diabetes – is categorical.

diabetes.Diabetes.unique()

array(['No', 'Yes'], dtype=object)

We can encode this using an indicator variable, which results in a binary response:

y = (diabetes.Diabetes == 'Yes').astype('int')
y.unique()

array([0, 1])

Remember, a statistical model is a probability distribution, so we need to choose one that’s appropriate for binary outcomes. Ideas?

What not to do

One might think:

But

discrete, not continuous
normal model doesn’t make sense for a binary response

What not to do

Note that you can still fit this model.

Code

# explanatory variable matrix
gender_indicator = pd.get_dummies(diabetes.Gender, drop_first = True)
x_vars = pd.concat([gender_indicator, diabetes.loc[:, ['Age', 'BMI']]], axis = 1)
x = sm.add_constant(x_vars)
y = (diabetes.Diabetes == 'Yes').astype('int')

# fit model
mlr = sm.OLS(endog = y, exog = x)
mlr.fit().params

const   -0.169002
male     0.010837
Age      0.002425
BMI      0.005602
dtype: float64

So you have to discern that it isn’t appropriate. A few ways to tell:

parameter interpretations won’t make sense – e.g. age is associated with a 0.0024 increase in diabetes presence
model may yield predictions that are negative or greater than one
plots will look odd

What not to do

Attempts at model visualization will look something like this:

Code

age_grid = x.Age.median()
male_grid = np.array([1])
bmi_grid = np.linspace(x.BMI.min(), x.BMI.max(), 100)
cx, ax, mx, bx = np.meshgrid(np.array([1]), age_grid, male_grid, bmi_grid)
grid = np.array([cx.ravel(), ax.ravel(), mx.ravel(), bx.ravel()]).T
grid_df = pd.DataFrame(grid, columns = ['const', 'age', 'male', 'bmi'])
pred_df = mlr.fit().get_prediction(grid_df).summary_frame()
pred_df = pd.concat([grid_df, pred_df], axis = 1)
diabetes['diabetes_indicator'] = y

points = alt.Chart(diabetes).mark_circle().encode(
    x = alt.X('BMI', title = 'body mass index'),
    y = alt.Y('diabetes_indicator:Q', title = 'diabetes')
)

line = alt.Chart(pred_df).mark_line().encode(
    x = 'bmi',
    y = 'mean'
)

(points + line).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Regression with a binary response

For a binary response , we model as a function of the explanatory variable(s) :

Of course, we don’t directly observe – but there are various ways around this.

Logistic regression model

The most common approach to modeling binary responses is logistic regression:

is a vector of explanatory variable(s)
is a vector of parameters

This model holds that the log odds of the outcome of interest is a linear function of the explanatory variable(s)

Logistic regression model

What does the model imply about the probability (rather than log-odds) of the outcome of interest?

Logistic regression model

The logistic function looks like this:

Code

from scipy.stats import norm
vals = pd.DataFrame({
    'x': np.linspace(-5, 5, 200)
})

vals['pr'] = 1/(1 + np.exp(-vals.x))

alt.Chart(vals).mark_line().encode(
    x = alt.X('x', title = 'x*beta'),
    y = alt.Y('pr', title = 'Pr(Y = 1)')
).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Assumptions

The model makes two key assumptions:

the probability of the outcome changes monotonically with each explanatory variable
observations are independent (used to obtain a joint distribution)

Estimation

The model is fit by maximum likelihood: find the parameters for which the observed data are most likely. The likelihood (joint distribution) is constructed from the model and the Bernoulli distribution.

# fit model
model = sm.Logit(endog = y, exog = x)
fit = model.fit()

# parameter estimates
coef_tbl = pd.DataFrame({
    'estimate': fit.params,
    'standard error': np.sqrt(fit.cov_params().values.diagonal())},
    index = x.columns
)
coef_tbl

Optimization terminated successfully.
         Current function value: 0.213931
         Iterations 8

	estimate	standard error
const	-8.199231	0.357565
male	0.270378	0.117799
Age	0.053200	0.003512
BMI	0.100607	0.007862

Parameter interpretations

Similar to linear regression, coefficients give the change in log-odds associated with incremental changes in the explanatory variables.

On the scale of the linear predictor:

A one-unit increase in BMI is associated with an estimated 0.1 increase in log odds of diabetes after adjusting for age and sex

On the scale of the odds:

A one-unit increase in BMI is associated with an estimated 10% increase in the odds of diabetes after adjusting for age and sex

On the probability scale, the increase depends on the starting value of BMI.

Confidence intervals

One can also give confidence intervals. These are based on large-sample approximations.

# confidence intervals
ci = fit.conf_int().rename(columns = {0:'lower',1:'upper'})
np.exp(ci.loc['BMI'])

lower    1.088933
upper    1.123013
Name: BMI, dtype: float64

With 95% confidence, each 1-unit increase in BMI is associated with an estimated increase in odds of diabetes between 8.9% and 12.3% after adjusting for age and sex

Fitted values

The fitted values for logistic regression are fitted probabilities (not outcomes).

statsmodels returns estimated log odds instead of fitted probabilities.

# log odds
fit.fittedvalues

0      -5.443558
1      -2.561428
2      -0.262283
3      -0.262283
4      -5.982723
          ...   
4826   -3.481417
4827   -3.481417
4828   -3.343464
4829   -1.970167
4830   -1.970167
Length: 4831, dtype: float64

Fitted values

To obtain probabilities, one could manually back-transform:

# compute fitted probabilities 'by hand'
fitted_probs = 1/(1 + np.exp(-fit.fittedvalues))
fitted_probs.head(5)

0    0.004305
1    0.071663
2    0.434803
3    0.434803
4    0.002516
dtype: float64

Or use the .predict() method:

# fitted probabilities
fit.predict(x).head(5)

0    0.004305
1    0.071663
2    0.434803
3    0.434803
4    0.002516
dtype: float64

Classification

For each observation (or new observations), probabilities can be computed directly from the fitted model:

But what if we want to classify a person as diabetic or not diabetic? Should we declare a case when…

more probable than not: ?
highly probable, say ?
somewhat probable, say ?

Sensitivity

If we use a low threshold for classification, say:

Then the classifications will be more sensitive to cases – most cases of diabetes will be correctly classified.

Specificity

If we use a high threshold instead, say:

Then the classification will not be very sensitive to cases, but they will be fairly specific – classifications will be correct for most people without diabetes.

Cross-tabulation

For any given classification threshold, we can cross-tabulate the classifications with the observed outcomes:

# confusion matrix
fit.pred_table(0.5)

array([[4446.,   19.],
       [ 358.,    8.]])

rows are observed outcomes
columns are predicted outcomes

Using the more-likely-than-not criterion is very specific (high true negative rate) but not at all sensitive (low true positive rate).

Overall accuracy is misleading

The proportion of correctly classified observations is:

# proportion of correctly classified observations
fit.pred_table().diagonal().sum()/len(y)

0.9219623266404471

This looks really good, but any method that classifies all or most observations as non-diabetic will achieve high accuracy because of the case imbalance in the data.

# proportion of non-diabetic respondents
np.mean(y == 0)

0.9242392879321052

Use class-wise error rates

Examining class-wise error rates reveals how asymmetric the classifications are:

Code

(fit.pred_table(0.5).T/fit.pred_table(0.5).sum(axis = 1)).T

array([[0.99574468, 0.00425532],
       [0.97814208, 0.02185792]])

same layout as confusion matrix, but with entries divided by the total number of outcomes in each class
note 97.8% error rate among diabetes cases

A better classifier

In this case we can do better by choosing a low classification threshold :

Code

fit.pred_table(0.1)

array([[3473.,  992.],
       [ 105.,  261.]])

higher overall error rate
but about 70% accurate within each class (diabetic and non-diabetic)

Class-wise errors:

Code

(fit.pred_table(0.1).T/fit.pred_table(0.1).sum(axis = 1)).T

array([[0.77782755, 0.22217245],
       [0.28688525, 0.71311475]])