estimate | standard error | |
---|---|---|

const | -4.751870 | 0.621673 |

log_income | 0.631469 | 0.058290 |

low | -0.414486 | 0.070483 |

medium | -0.675626 | 0.079552 |

high | -0.606980 | 0.101708 |

error variance | 1.552578 | NaN |

PSTAT100 Spring 2023

Invalid Date

- MP2 due today; late deadline is Friday
- HW4 due next
*Wednesday*(not Monday); late deadline is Friday - No OH today
- lecture cancelled Monday 6/5 for data science event

Loma Pelona Center 1pm – 4pm

- 1pm: Graduate school presentation and panel discussion with recent students
- 2pm: capstone project showcase
- 3pm: Careers in data science panel discussion

We fit this model to the tree cover data:

\[ \log(\text{cover}_i) = \beta_0 + \beta_1 \log(\text{income}_i) + \beta_2 \text{low}_i + \beta_3 \text{med}_i + \beta_4 \text{high}_i + \epsilon_i \]

Each level of population density has its own intercept:

\[ \begin{align*} \text{very low density:}\quad &\mathbb{E}\log(\text{cover}) = \beta_0 + \beta_1 \log(\text{income}) \\ \text{low density:}\quad &\mathbb{E}\log(\text{cover}) = (\beta_0 + \beta_2) + \beta_1 \log(\text{income}) \\ \text{medium density:}\quad &\mathbb{E}\log(\text{cover}) = (\beta_0 + \beta_3) + \beta_1 \log(\text{income}) \\ \text{high density:}\quad &\mathbb{E}\log(\text{cover}) = (\beta_0 + \beta_4) + \beta_1 \log(\text{income}) \end{align*} \]

\(\beta_2, \beta_3, \beta_4\) represent the *difference in expected log cover* between very low density and low, medium, high density after accounting for income

estimate | standard error | |
---|---|---|

const | -4.751870 | 0.621673 |

log_income | 0.631469 | 0.058290 |

low | -0.414486 | 0.070483 |

medium | -0.675626 | 0.079552 |

high | -0.606980 | 0.101708 |

error variance | 1.552578 | NaN |

- each doubling of mean income is associated with an estimated 55% increase in median tree cover, after accounting for population density
- census blocks with higher population densities are estimated as having a median tree canopy up to 50% lower than census blocks with very low population densities, after accounting for mean income

The model is \(\log (y) \sim N(x\beta, \sigma^2)\); so \(y\) is what’s known as a *lognormal* random variable.

From the properties of the lognormal distribution:

\[ e^{x\beta} = \text{median}(y) \]

So when parameters are back-transformed, they should be interpreted in terms of the *median* response.

\[ 55\% \text{ increase in median tree cover}: e^{\hat{\beta}_1\log(2)} = 1.549 \]

Median cover increases by a factor of 1.549, *i.e.*, increases by 54.9%:

- doubling income increments log income by \(\log(2)\)
- \(\hat{\beta}_1\log(2)\) gives the associated change in mean log cover
- exponentiating the change in mean log cover gives the multiplicative change in median cover

```
x_new = np.array([1, np.log(115000), 0, 1, 0])
pred = rslt.get_prediction(x_new)
np.exp(pred.summary_frame())
```

mean | mean_se | mean_ci_lower | mean_ci_upper | obs_ci_lower | obs_ci_upper | |
---|---|---|---|---|---|---|

0 | 6.895105 | 1.102304 | 5.696153 | 8.346419 | 0.594349 | 79.990893 |

Fill in the blanks:

- the median tree cover for a _______ density census block with mean income _______ is estimated to be between _______ and _______ percent
- the tree cover for a _______ density census block with mean income _______ is estimated to be between _______ and _______ percent

The linearity and constant variance assumptions can be assessed by plotting residuals against fitted values:

4831 responses from the 2011-2012 National Health and Nutrition Examination Survey (NHANES):

Gender | Age | BMI | Diabetes | |
---|---|---|---|---|

0 | male | 14 | 17.3 | No |

1 | female | 43 | 33.3 | No |

2 | male | 80 | 33.9 | No |

3 | male | 80 | 33.9 | No |

Is BMI a risk factor for diabetes after adjusting for age and sex?

Broadly, we can answer the question by estimating the dependence of diabetes status on age, sex, and BMI.

An additive model might look something like this: \[ \text{diabetes}_i \longleftarrow \beta_1\text{age}_i\;+\; \beta_2\text{male}_i\;+\;\beta_3\text{BMI}_i \]

To answer the question, fit the model and examine \(\beta_3\).

Note that the response variable – whether the respondent has diabetes – is categorical.

We can encode this using an indicator variable, which results in a binary response:

Remember, a statistical model is a probability distribution, so we need to choose one that’s appropriate for binary outcomes. Ideas?

One might think:

\[ \text{diabetes}_i = \beta_0 + \beta_1\text{age}_i + \beta_2\text{male}_i+\beta_3\text{BMI}_i + \epsilon_i \]

But \(\text{diabetes}_i \not\sim N(x\beta, \sigma^2)\)

- discrete, not continuous
- normal model doesn’t make sense for a binary response

Note that you *can* still fit this model.

```
# explanatory variable matrix
gender_indicator = pd.get_dummies(diabetes.Gender, drop_first = True)
x_vars = pd.concat([gender_indicator, diabetes.loc[:, ['Age', 'BMI']]], axis = 1)
x = sm.add_constant(x_vars)
y = (diabetes.Diabetes == 'Yes').astype('int')
# fit model
mlr = sm.OLS(endog = y, exog = x)
mlr.fit().params
```

```
const -0.169002
male 0.010837
Age 0.002425
BMI 0.005602
dtype: float64
```

So you have to discern that it isn’t appropriate. A few ways to tell:

- parameter interpretations won’t make sense –
*e.g.*age is associated with a 0.0024 increase in diabetes presence - model may yield predictions that are negative or greater than one
- plots will look odd

Attempts at model visualization will look something like this:

```
age_grid = x.Age.median()
male_grid = np.array([1])
bmi_grid = np.linspace(x.BMI.min(), x.BMI.max(), 100)
cx, ax, mx, bx = np.meshgrid(np.array([1]), age_grid, male_grid, bmi_grid)
grid = np.array([cx.ravel(), ax.ravel(), mx.ravel(), bx.ravel()]).T
grid_df = pd.DataFrame(grid, columns = ['const', 'age', 'male', 'bmi'])
pred_df = mlr.fit().get_prediction(grid_df).summary_frame()
pred_df = pd.concat([grid_df, pred_df], axis = 1)
diabetes['diabetes_indicator'] = y
points = alt.Chart(diabetes).mark_circle().encode(
x = alt.X('BMI', title = 'body mass index'),
y = alt.Y('diabetes_indicator:Q', title = 'diabetes')
)
line = alt.Chart(pred_df).mark_line().encode(
x = 'bmi',
y = 'mean'
)
(points + line).configure_axis(
labelFontSize = 14,
titleFontSize = 16
)
```

For a binary response \(Y \in \{0, 1\}\), we model \(P(Y = 1)\) as a function of the explanatory variable(s) \(x\):

\[ P(Y = 1) = f(x) \]

Of course, we don’t directly observe \(P(Y = 1)\) – but there are various ways around this.

The most common approach to modeling binary responses is *logistic regression*:

\[ \log\left(\frac{p}{1 - p}\right) = x'\beta \]

- \(p = P(Y = 1)\)
- \(x\) is a vector of explanatory variable(s)
- \(\beta\) is a vector of parameters

This model holds that

the log odds of the outcome of interest is a linear function of the explanatory variable(s)

What does the model imply about the probability (rather than log-odds) of the outcome of interest? \[ \log\left(\frac{p}{1 - p}\right) = x'\beta \quad\Longleftrightarrow\quad P(Y = 1) = \;?? \]

The logistic function looks like this:

The model makes two key assumptions:

- the probability of the outcome changes monotonically with each explanatory variable
- observations are independent (used to obtain a joint distribution)

The model is fit by maximum likelihood: find the parameters for which the observed data are most likely. The likelihood (joint distribution) is constructed from the model and the Bernoulli distribution.

```
# fit model
model = sm.Logit(endog = y, exog = x)
fit = model.fit()
# parameter estimates
coef_tbl = pd.DataFrame({
'estimate': fit.params,
'standard error': np.sqrt(fit.cov_params().values.diagonal())},
index = x.columns
)
coef_tbl
```

```
Optimization terminated successfully.
Current function value: 0.213931
Iterations 8
```

estimate | standard error | |
---|---|---|

const | -8.199231 | 0.357565 |

male | 0.270378 | 0.117799 |

Age | 0.053200 | 0.003512 |

BMI | 0.100607 | 0.007862 |

Similar to linear regression, coefficients give the change in log-odds associated with incremental changes in the explanatory variables.

On the scale of the linear predictor:

A one-unit increase in BMI is associated with an estimated 0.1 increase in log odds of diabetes after adjusting for age and sex

On the scale of the odds:

A one-unit increase in BMI is associated with an estimated 10% increase in the odds of diabetes after adjusting for age and sex

On the probability scale, the increase depends on the starting value of BMI.

One can also give confidence intervals. These are based on large-sample approximations.

```
# confidence intervals
ci = fit.conf_int().rename(columns = {0:'lower',1:'upper'})
np.exp(ci.loc['BMI'])
```

```
lower 1.088933
upper 1.123013
Name: BMI, dtype: float64
```

With 95% confidence, each 1-unit increase in BMI is associated with an estimated increase in odds of diabetes between 8.9% and 12.3% after adjusting for age and sex

The fitted values for logistic regression are fitted *probabilities* (not outcomes).

\[ \hat{p}_i = \frac{1}{1 + e^{-x_i'\hat{\beta}}} \]

To obtain probabilities, one could manually back-transform:

For each observation (or new observations), probabilities can be computed directly from the fitted model:

\[ \hat{p}_i = \frac{1}{1 + e^{-x_i'\hat{\beta}}} \]

But what if we want to classify a person as diabetic or not diabetic? Should we declare a case when…

- more probable than not: \(\hat{p} > 0.5\)?
- highly probable, say \(\hat{p} > 0.8\)?
- somewhat probable, say \(\hat{p} > 0.2\)?

If we use a *low* threshold for classification, say:

\[ \hat{Y} = 1 \quad\Longleftrightarrow\quad \hat{p} > 0.1 \]

Then the classifications will be more *sensitive* to cases – most cases of diabetes will be correctly classified.

If we use a *high* threshold instead, say:

\[ \hat{Y} = 1 \quad\Longleftrightarrow\quad \hat{p} > 0.9 \]

Then the classification will not be very sensitive to cases, but they will be fairly *specific* – classifications will be correct for most people without diabetes.

For any given classification threshold, we can cross-tabulate the classifications with the observed outcomes:

- rows are observed outcomes
- columns are predicted outcomes

Using the more-likely-than-not criterion is very *specific* (high true negative rate) but not at all *sensitive* (low true positive rate).

The proportion of correctly classified observations is:

`0.9219623266404471`

Examining class-wise error rates reveals how asymmetric the classifications are:

```
array([[0.99574468, 0.00425532],
[0.97814208, 0.02185792]])
```

- same layout as confusion matrix, but with entries divided by the total number of outcomes in each class
- note 97.8% error rate among diabetes cases

In this case we can do better by choosing a low classification threshold \(\hat{p} > 0.1\):

- higher overall error rate \(\frac{1097}{4831} = 0.227\)
- but about 70% accurate within each class (diabetic and non-diabetic)