Announcements
Lab 7 optional, turn in by end of day Friday 6/9
HW4 due Wednesday 6/7 with late submissions by Friday 6/9
Course project due Friday 6/16; groups preferred, max of 3
Extra OH Tuesday 1-3pm (scheduled final time)?
Last time
We fit the logistic regression model:
\[
\log\left(\frac{Pr(\text{diabetic}_i)}{1 - Pr(\text{diabetic}_i)}\right) = \beta_0 + \beta_1\text{age}_i + \beta_2\text{male}_i+\beta_3\text{BMI}_i
\]
And discussed:
model specification
parameter estimation
parameter interpretation
Last time
The logistic regression model can be employed as a classifier by articulating a rule:
\[
\text{classify as diabetic}
\quad\Longleftrightarrow\quad
\widehat{Pr}(\text{diabetic}) > c
\]
where \(\widehat{Pr}(\text{diabetic})\) is computed from the fitted logistic regression model.
Any classifier has some:
sensitivity (true positive rate)
specificity (true negative rate)
These rates vary depending on \(c\) .
ROC Curves
A plot of sensitivity against specificity across all unique classification thresholds is known as a receiver operating characteristic (ROC) curve.
Code
# compute
from sklearn import metrics
fpr, tpr, thresh = metrics.roc_curve(y, fitted_probs)
roc = pd.DataFrame({
'fpr' : fpr,
'tpr' : tpr,
'thresh' : thresh
})
roc_opt_ix = [(roc.tpr - roc.fpr).argmax(), ((1 - roc.tpr)** 2 + roc.fpr** 2 ).argmin()]
roc_opt = roc.loc[roc_opt_ix]
roc_plot = alt.Chart(roc).mark_line().encode(
x = alt.X('fpr' , title = '1 - specificity' ),
y = alt.Y('tpr' , title = 'sensitivity' )
)
roc_pts = alt.Chart(roc_opt).mark_circle(
fill = 'red' ,
size = 100
).encode(
x = 'fpr' ,
y = 'tpr'
)
(roc_plot + roc_pts).configure_axis(
labelFontSize = 14 ,
titleFontSize = 16
)
closer to the upper left corner \(\longrightarrow\) less trade-off \(\longrightarrow\) better classifier
area under the curve often used as an accuracy metric
two common choices for \(c\) highlighted:
the point closest to the upper left corner
the point that maximizes sensitivity + specificity (sometimes called Youden’s J statistic )
Probit regression model
Logistic regression is not the only regression model for binary data. One common alternative is probit regression , where:
\[
P(Y = 1) = \Phi\left(x'\beta\right)
\]
\(\Phi\) is the standard normal CDF
similar assumptions to logistic regression – independence and monotonicity
Probit v. logit
The logistic function has heavier tails.
Code
from scipy.stats import norm
# grid of values
vals = pd.DataFrame({
'x' : np.linspace(- 5 , 5 , 200 )
})
# compute probit and logit
vals['logit' ] = 1 / (1 + np.exp(- vals.x))
vals['probit' ] = norm.cdf(vals.x)
vals = vals.melt(id_vars = 'x' , var_name = 'model' , value_name = 'pr' )
# plot
alt.Chart(vals).mark_line().encode(
x = alt.X('x' , title = 'x*beta' ),
y = alt.Y('pr' , title = 'Pr(Y = 1)' ),
color = 'model' ,
stroke = 'model'
).configure_axis(
labelFontSize = 14 ,
titleFontSize = 16
).configure_legend(
labelFontSize = 14 ,
titleFontSize = 16
)
Probit fit to diabetes data
Here are the estimates from the probit model:
Code
# fit model
model_probit = sm.Probit(endog = y, exog = x)
fit_probit = model_probit.fit()
# parameter estimates
coef_tbl_probit = pd.DataFrame({
'estimate' : fit_probit.params,
'standard error' : np.sqrt(fit_probit.cov_params().values.diagonal())},
index = x.columns
)
coef_tbl_probit
Optimization terminated successfully.
Current function value: 0.214265
Iterations 8
const
-4.215812
0.172291
male
0.116145
0.061036
Age
0.025876
0.001733
BMI
0.051143
0.004201
Trickier to interpret coefficients directly, since \(\Phi^{-1}(p)\) is not a natural quantity
Challenges of interpretation
The effect of incremental changes in explanatory variables on predicted probabilities depends on your starting point.
# means of explanatory variables
x1 = x.mean()
# increment bmi twice
x2 = x1 + np.array([0 , 0 , 0 , 1 ])
x3 = x2 + np.array([0 , 0 , 0 , 1 ])
x_pred = pd.concat([x1, x2, x3], axis = 1 ).T
x_pred['male' ] = np.array([0 , 0 , 0 ])
# compute predictions and differences in probability
preds = fit_probit.predict(x_pred)
preds.diff()
0 NaN
1 0.003587
2 0.003941
dtype: float64
a one-unit increase in BMI from 26.45 (sample mean) for a 37.6 year old woman is associated with an estimated change in probability of diabetes of 0.0036
a one-unit increase in BMI from 27.45 (sample mean plus one) for a 37.6 year old woman is associated with an estimated change in probability of diabetes of 0.0039
Centering for interpretability
If explanatory variables are centered, then the change in estimated probability associated with a 1-unit change from the mean (and reference level(s)) is:
\[
\Phi\left(\hat{\beta}_0 + \hat{\beta}_j\right) - \Phi\left(\hat{\beta}_0\right)
\]
Refitting the model after centering age and BMI and computing the above yields:
Code
# center explanatory variables
x_vars_ctr = x_vars - x.mean()
x_ctr = pd.concat([x.loc[:, ['const' , 'male' ]], x_vars_ctr.loc[:, ['Age' , 'BMI' ]]], axis = 1 )
# fit model
model_probit_ctr = sm.Probit(endog = y, exog = x_ctr)
fit_probit_ctr = model_probit_ctr.fit()
# baseline
probit_baseline = norm.cdf(fit_probit_ctr.params[0 ])
# changes in estimated probabilities associated with one-unit change from mean, keeping other variables at mean/reference
prob_diffs = norm.cdf(fit_probit_ctr.params[1 :4 ] + fit_probit_ctr.params[0 ]) - probit_baseline
# print
pd.DataFrame({'change in probability' : prob_diffs}, index = np.array(['male' , 'age' , 'BMI' ]))
Optimization terminated successfully.
Current function value: 0.214265
Iterations 8
male
0.008659
age
0.001772
BMI
0.003587
Centering for interpretation
Now the coefficent interpretations are:
the estimated probability that a woman of average age and BMI has diabetes is 0.029
among people of average age and BMI, men are more likely than women to be diabetic with an estimated difference in probability of 0.009
\(0.008659 = \Phi(\hat{\beta}_0 + \hat{\beta}_1) - \Phi(\hat{\beta}_0)\)
a one-year increase from the average age is associated with a change in the estimated probability that a woman of average BMI is diabetic of 0.002
\(0.001772 = \Phi(\hat{\beta}_0 + \hat{\beta}_2) - \Phi(\hat{\beta}_0)\)
a one-unit increase in BMI from the average is associated with a change in the estimated probability that a woman of average age is diabetic of 0.004
\(0.003587 = \Phi(\hat{\beta}_0 + \hat{\beta}_3) - \Phi(\hat{\beta}_0)\)
Un/Supervised problems
Regression and classification are known as ‘supervised’ problems:
the response variable/outcome is observed
the modeling of data is guided by observation
By contrast, in ‘unsupervised’ problems:
the response variable/outcome is not observed
no ground truth to guide/supervise the modeling process
Clustering
Clustering is the unsupervised version of classification:
Can we classify observations into two or more groups based on \(p\) variables without knowing the true grouping structure?
can think of this as modeling an unobserved response
however, not necessary that there exist subpopulations in the data – often a useful exploratory technique for exploring multimodal distributions
Voting records, 116th House
Roll call votes of the 116th House of Representatives on bills and resolutions:
Code
members = pd.read_csv('data/members.csv' ).set_index('name_id' )
votes = pd.read_csv('data/votes-clean.csv' ).set_index('name_id' )
vote_info = pd.read_csv('data/votes-info.csv' ).set_index('rollcall_id' )
votes.head(3 )
name_id
A000374
-1
-1
-1
-1
-1
0
-1
1
0
1
...
1
0
1
0
1
1
-1
1
-1
-1
A000370
1
1
1
1
1
0
1
1
1
1
...
1
1
1
1
1
1
1
1
1
1
A000055
-1
-1
-1
-1
-1
-1
-1
1
1
1
...
1
1
1
1
1
1
-1
1
1
-1
3 rows × 144 columns
each column is a roll call (\(p = 144\) total)
each row is a representative (\(n = 430\) total)
1 is a “yes” vote; 0 is an abstention; -1 is a “no” vote
Clustering voting data
Question: Can we identify groups of representatives that voted similarly?
Can cluster the representatives according to roll call votes
But how many clusters to expect?
EDA with PCA
Projecting the data onto the first few principal components provides a way to visualize the data:
Code
pca = sm.PCA(votes)
alt.Chart(pca.scores).mark_circle(opacity = 0.5 ).encode(
x = alt.X('comp_000' , title = 'PC1' ),
y = alt.Y('comp_001' , title = 'PC2' )
).configure_axis(
labelFontSize = 14 ,
titleFontSize = 16
).configure_legend(
labelFontSize = 14 ,
titleFontSize = 16
)
can see at least two clusters
Clustering with \(K\) -means
The most widely used clustering method is known as \(K\) -means.
cluster labels are based on shortest Euclidean distance to one of \(K\) centers
centers are found by minimizing the variance within each cluster
Clustering with \(K\) -means
Technically, given \(n\) observations of \(p\) variables \(\mathbf{X} = \{x_{ij}\}\) , the \(k\) -means problem is:
\[
\text{minimize}_{C_1, \dots, C_K} \left\{
\sum_{k = 1}^K |C_k|^{-1} \sum_{i, i' \in C_k} \sum_{j = 1}^p (x_{ij} - x_{i'j})^2
\right\}
\]
A local solution is found by starting with random cluster assignments and then interatively:
Update the cluster centers
Reassign the cluster labels
Clustering voting data
The method is very easy to implement:
from sklearn.cluster import KMeans
np.random.seed(60623 )
clust = KMeans(n_clusters = 2 )
clust.fit(votes)
clust_labels = clust.predict(votes)
Cluster labels will be returned in the same order as the rows input to .predict()
Initialization is random, so solutions may differ from run to run (usually just permutes labels).
Clustering voting data
We could again visualize the clusters using PCA:
Code
plot_df = pca.scores.iloc[:, 0 :2 ].copy()
plot_df['cluster' ] = clust_labels
alt.Chart(plot_df).mark_circle(opacity = 0.5 ).encode(
x = alt.X('comp_000' , title = 'PC1' ),
y = alt.Y('comp_001' , title = 'PC2' ),
color = 'cluster:N'
).configure_axis(
labelFontSize = 14 ,
titleFontSize = 16
).configure_legend(
labelFontSize = 14 ,
titleFontSize = 16
)
Cluster composition by party
We could also cross-tabulate the cluster labels with party affiliations:
Code
label_df = pd.DataFrame({'cluster' : clust_labels}, index = votes.index)
pd.merge(members, label_df, left_index = True , right_index = True ).groupby(['current_party' , 'cluster' ]).size().reset_index().pivot(columns = 'current_party' , index = 'cluster' )
current_party
Democratic
Independent
Republican
0
232.0
NaN
3.0
1
NaN
1.0
194.0
Cluster composition by party
Who are those three representatives that vote with the democrats?
Code
members_labeled = pd.merge(members, label_df, left_index = True , right_index = True )
members_labeled[(members_labeled.cluster == 0 ) & (members_labeled.current_party == 'Republican' )]
name_id
F000466
brian-fitzpatrick
Pennsylvania
https://www.congress.gov/member/brian-fitzpatr...
House
Republican
['Foreign Affairs', 'Transportation and Infras...
0
K000386
john-katko
New York
https://www.congress.gov/member/john-katko/K00...
House
Republican
['Homeland Security', 'Transportation and Infr...
0
S000522
christopher-smith
New Jersey
https://www.congress.gov/member/christopher-sm...
House
Republican
['Foreign Affairs']
0
Further possible questions
We could use the same technique to explore a variety of additional questions:
identify voting blocs by issue or policy area (use a subset of columns)
find within-party voting blocs (increase \(K\) )
identify representatives that don’t tend to vote together with others (assign a score based on how ‘quickly’ a representative is isolated)