estimate | standard error | |
---|---|---|
const | -4.751870 | 0.621673 |
log_income | 0.631469 | 0.058290 |
low | -0.414486 | 0.070483 |
medium | -0.675626 | 0.079552 |
high | -0.606980 | 0.101708 |
error variance | 1.552578 | NaN |
Invalid Date
Loma Pelona Center 1pm – 4pm
We fit this model to the tree cover data:
\[ \log(\text{cover}_i) = \beta_0 + \beta_1 \log(\text{income}_i) + \beta_2 \text{low}_i + \beta_3 \text{med}_i + \beta_4 \text{high}_i + \epsilon_i \]
Each level of population density has its own intercept:
\[ \begin{align*} \text{very low density:}\quad &\mathbb{E}\log(\text{cover}) = \beta_0 + \beta_1 \log(\text{income}) \\ \text{low density:}\quad &\mathbb{E}\log(\text{cover}) = (\beta_0 + \beta_2) + \beta_1 \log(\text{income}) \\ \text{medium density:}\quad &\mathbb{E}\log(\text{cover}) = (\beta_0 + \beta_3) + \beta_1 \log(\text{income}) \\ \text{high density:}\quad &\mathbb{E}\log(\text{cover}) = (\beta_0 + \beta_4) + \beta_1 \log(\text{income}) \end{align*} \]
\(\beta_2, \beta_3, \beta_4\) represent the difference in expected log cover between very low density and low, medium, high density after accounting for income
estimate | standard error | |
---|---|---|
const | -4.751870 | 0.621673 |
log_income | 0.631469 | 0.058290 |
low | -0.414486 | 0.070483 |
medium | -0.675626 | 0.079552 |
high | -0.606980 | 0.101708 |
error variance | 1.552578 | NaN |
The model is \(\log (y) \sim N(x\beta, \sigma^2)\); so \(y\) is what’s known as a lognormal random variable.
From the properties of the lognormal distribution:
\[ e^{x\beta} = \text{median}(y) \]
So when parameters are back-transformed, they should be interpreted in terms of the median response.
\[ 55\% \text{ increase in median tree cover}: e^{\hat{\beta}_1\log(2)} = 1.549 \]
Median cover increases by a factor of 1.549, i.e., increases by 54.9%:
x_new = np.array([1, np.log(115000), 0, 1, 0])
pred = rslt.get_prediction(x_new)
np.exp(pred.summary_frame())
mean | mean_se | mean_ci_lower | mean_ci_upper | obs_ci_lower | obs_ci_upper | |
---|---|---|---|---|---|---|
0 | 6.895105 | 1.102304 | 5.696153 | 8.346419 | 0.594349 | 79.990893 |
Fill in the blanks:
The linearity and constant variance assumptions can be assessed by plotting residuals against fitted values:
4831 responses from the 2011-2012 National Health and Nutrition Examination Survey (NHANES):
Gender | Age | BMI | Diabetes | |
---|---|---|---|---|
0 | male | 14 | 17.3 | No |
1 | female | 43 | 33.3 | No |
2 | male | 80 | 33.9 | No |
3 | male | 80 | 33.9 | No |
Is BMI a risk factor for diabetes after adjusting for age and sex?
Broadly, we can answer the question by estimating the dependence of diabetes status on age, sex, and BMI.
An additive model might look something like this: \[ \text{diabetes}_i \longleftarrow \beta_1\text{age}_i\;+\; \beta_2\text{male}_i\;+\;\beta_3\text{BMI}_i \]
To answer the question, fit the model and examine \(\beta_3\).
Note that the response variable – whether the respondent has diabetes – is categorical.
We can encode this using an indicator variable, which results in a binary response:
Remember, a statistical model is a probability distribution, so we need to choose one that’s appropriate for binary outcomes. Ideas?
One might think:
\[ \text{diabetes}_i = \beta_0 + \beta_1\text{age}_i + \beta_2\text{male}_i+\beta_3\text{BMI}_i + \epsilon_i \]
But \(\text{diabetes}_i \not\sim N(x\beta, \sigma^2)\)
Note that you can still fit this model.
# explanatory variable matrix
gender_indicator = pd.get_dummies(diabetes.Gender, drop_first = True)
x_vars = pd.concat([gender_indicator, diabetes.loc[:, ['Age', 'BMI']]], axis = 1)
x = sm.add_constant(x_vars)
y = (diabetes.Diabetes == 'Yes').astype('int')
# fit model
mlr = sm.OLS(endog = y, exog = x)
mlr.fit().params
const -0.169002
male 0.010837
Age 0.002425
BMI 0.005602
dtype: float64
So you have to discern that it isn’t appropriate. A few ways to tell:
Attempts at model visualization will look something like this:
age_grid = x.Age.median()
male_grid = np.array([1])
bmi_grid = np.linspace(x.BMI.min(), x.BMI.max(), 100)
cx, ax, mx, bx = np.meshgrid(np.array([1]), age_grid, male_grid, bmi_grid)
grid = np.array([cx.ravel(), ax.ravel(), mx.ravel(), bx.ravel()]).T
grid_df = pd.DataFrame(grid, columns = ['const', 'age', 'male', 'bmi'])
pred_df = mlr.fit().get_prediction(grid_df).summary_frame()
pred_df = pd.concat([grid_df, pred_df], axis = 1)
diabetes['diabetes_indicator'] = y
points = alt.Chart(diabetes).mark_circle().encode(
x = alt.X('BMI', title = 'body mass index'),
y = alt.Y('diabetes_indicator:Q', title = 'diabetes')
)
line = alt.Chart(pred_df).mark_line().encode(
x = 'bmi',
y = 'mean'
)
(points + line).configure_axis(
labelFontSize = 14,
titleFontSize = 16
)
For a binary response \(Y \in \{0, 1\}\), we model \(P(Y = 1)\) as a function of the explanatory variable(s) \(x\):
\[ P(Y = 1) = f(x) \]
Of course, we don’t directly observe \(P(Y = 1)\) – but there are various ways around this.
The most common approach to modeling binary responses is logistic regression:
\[ \log\left(\frac{p}{1 - p}\right) = x'\beta \]
This model holds that the log odds of the outcome of interest is a linear function of the explanatory variable(s)
What does the model imply about the probability (rather than log-odds) of the outcome of interest? \[ \log\left(\frac{p}{1 - p}\right) = x'\beta \quad\Longleftrightarrow\quad P(Y = 1) = \;?? \]
The logistic function looks like this:
The model makes two key assumptions:
The model is fit by maximum likelihood: find the parameters for which the observed data are most likely. The likelihood (joint distribution) is constructed from the model and the Bernoulli distribution.
# fit model
model = sm.Logit(endog = y, exog = x)
fit = model.fit()
# parameter estimates
coef_tbl = pd.DataFrame({
'estimate': fit.params,
'standard error': np.sqrt(fit.cov_params().values.diagonal())},
index = x.columns
)
coef_tbl
Optimization terminated successfully.
Current function value: 0.213931
Iterations 8
estimate | standard error | |
---|---|---|
const | -8.199231 | 0.357565 |
male | 0.270378 | 0.117799 |
Age | 0.053200 | 0.003512 |
BMI | 0.100607 | 0.007862 |
Similar to linear regression, coefficients give the change in log-odds associated with incremental changes in the explanatory variables.
On the scale of the linear predictor:
A one-unit increase in BMI is associated with an estimated 0.1 increase in log odds of diabetes after adjusting for age and sex
On the scale of the odds:
A one-unit increase in BMI is associated with an estimated 10% increase in the odds of diabetes after adjusting for age and sex
On the probability scale, the increase depends on the starting value of BMI.
One can also give confidence intervals. These are based on large-sample approximations.
# confidence intervals
ci = fit.conf_int().rename(columns = {0:'lower',1:'upper'})
np.exp(ci.loc['BMI'])
lower 1.088933
upper 1.123013
Name: BMI, dtype: float64
With 95% confidence, each 1-unit increase in BMI is associated with an estimated increase in odds of diabetes between 8.9% and 12.3% after adjusting for age and sex
The fitted values for logistic regression are fitted probabilities (not outcomes).
\[ \hat{p}_i = \frac{1}{1 + e^{-x_i'\hat{\beta}}} \]
To obtain probabilities, one could manually back-transform:
For each observation (or new observations), probabilities can be computed directly from the fitted model:
\[ \hat{p}_i = \frac{1}{1 + e^{-x_i'\hat{\beta}}} \]
But what if we want to classify a person as diabetic or not diabetic? Should we declare a case when…
If we use a low threshold for classification, say:
\[ \hat{Y} = 1 \quad\Longleftrightarrow\quad \hat{p} > 0.1 \]
Then the classifications will be more sensitive to cases – most cases of diabetes will be correctly classified.
If we use a high threshold instead, say:
\[ \hat{Y} = 1 \quad\Longleftrightarrow\quad \hat{p} > 0.9 \]
Then the classification will not be very sensitive to cases, but they will be fairly specific – classifications will be correct for most people without diabetes.
For any given classification threshold, we can cross-tabulate the classifications with the observed outcomes:
Using the more-likely-than-not criterion is very specific (high true negative rate) but not at all sensitive (low true positive rate).
The proportion of correctly classified observations is:
0.9219623266404471
Examining class-wise error rates reveals how asymmetric the classifications are:
array([[0.99574468, 0.00425532],
[0.97814208, 0.02185792]])
In this case we can do better by choosing a low classification threshold \(\hat{p} > 0.1\):