PSTAT100 - Clustering

Announcements

Lab 7 optional, turn in by end of day Friday 6/9
HW4 due Wednesday 6/7 with late submissions by Friday 6/9
Course project due Friday 6/16; groups preferred, max of 3
Extra OH Tuesday 1-3pm (scheduled final time)?

Last time

We fit the logistic regression model:

And discussed:

model specification
parameter estimation
parameter interpretation

Last time

The logistic regression model can be employed as a classifier by articulating a rule:

where is computed from the fitted logistic regression model.

Any classifier has some:

sensitivity (true positive rate)
specificity (true negative rate)

These rates vary depending on .

ROC Curves

A plot of sensitivity against specificity across all unique classification thresholds is known as a receiver operating characteristic (ROC) curve.

Code

# compute 
from sklearn import metrics
fpr, tpr, thresh = metrics.roc_curve(y, fitted_probs)

roc = pd.DataFrame({
    'fpr': fpr,
    'tpr': tpr,
    'thresh': thresh
})

roc_opt_ix = [(roc.tpr - roc.fpr).argmax(), ((1 - roc.tpr)**2 + roc.fpr**2).argmin()]
roc_opt = roc.loc[roc_opt_ix]

roc_plot = alt.Chart(roc).mark_line().encode(
    x = alt.X('fpr', title = '1 - specificity'),
    y = alt.Y('tpr', title = 'sensitivity')
)

roc_pts = alt.Chart(roc_opt).mark_circle(
    fill = 'red',
    size = 100
).encode(
    x = 'fpr',
    y = 'tpr'
)

(roc_plot + roc_pts).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

closer to the upper left corner less trade-off better classifier
area under the curve often used as an accuracy metric
two common choices for highlighted:
- the point closest to the upper left corner
- the point that maximizes sensitivity + specificity (sometimes called Youden’s J statistic)

Probit regression model

Logistic regression is not the only regression model for binary data. One common alternative is probit regression, where:

is the standard normal CDF
similar assumptions to logistic regression – independence and monotonicity

Probit v. logit

The logistic function has heavier tails.

Code

from scipy.stats import norm
# grid of values
vals = pd.DataFrame({
    'x': np.linspace(-5, 5, 200)
})

# compute probit and logit
vals['logit'] = 1/(1 + np.exp(-vals.x))
vals['probit'] = norm.cdf(vals.x)
vals = vals.melt(id_vars = 'x', var_name = 'model', value_name = 'pr')

# plot
alt.Chart(vals).mark_line().encode(
    x = alt.X('x', title = 'x*beta'),
    y = alt.Y('pr', title = 'Pr(Y = 1)'),
    color = 'model',
    stroke = 'model'
).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
).configure_legend(
    labelFontSize = 14,
    titleFontSize = 16
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Probit fit to diabetes data

Here are the estimates from the probit model:

Code

# fit model
model_probit = sm.Probit(endog = y, exog = x)
fit_probit = model_probit.fit()

# parameter estimates
coef_tbl_probit = pd.DataFrame({
    'estimate': fit_probit.params,
    'standard error': np.sqrt(fit_probit.cov_params().values.diagonal())},
    index = x.columns
)
coef_tbl_probit

Optimization terminated successfully.
         Current function value: 0.214265
         Iterations 8

	estimate	standard error
const	-4.215812	0.172291
male	0.116145	0.061036
Age	0.025876	0.001733
BMI	0.051143	0.004201

Trickier to interpret coefficients directly, since is not a natural quantity

Challenges of interpretation

The effect of incremental changes in explanatory variables on predicted probabilities depends on your starting point.

# means of explanatory variables
x1 = x.mean()

# increment bmi twice
x2 = x1 + np.array([0, 0, 0, 1])
x3 = x2 + np.array([0, 0, 0, 1])
x_pred = pd.concat([x1, x2, x3], axis = 1).T
x_pred['male'] = np.array([0, 0, 0])

# compute predictions and differences in probability
preds = fit_probit.predict(x_pred)
preds.diff()

0         NaN
1    0.003587
2    0.003941
dtype: float64

a one-unit increase in BMI from 26.45 (sample mean) for a 37.6 year old woman is associated with an estimated change in probability of diabetes of 0.0036
a one-unit increase in BMI from 27.45 (sample mean plus one) for a 37.6 year old woman is associated with an estimated change in probability of diabetes of 0.0039

Centering for interpretability

If explanatory variables are centered, then the change in estimated probability associated with a 1-unit change from the mean (and reference level(s)) is:

Refitting the model after centering age and BMI and computing the above yields:

Code

# center explanatory variables
x_vars_ctr = x_vars - x.mean()
x_ctr = pd.concat([x.loc[:, ['const', 'male']], x_vars_ctr.loc[:, ['Age', 'BMI']]], axis = 1)

# fit model
model_probit_ctr = sm.Probit(endog = y, exog = x_ctr)
fit_probit_ctr = model_probit_ctr.fit()

# baseline
probit_baseline = norm.cdf(fit_probit_ctr.params[0])

# changes in estimated probabilities associated with one-unit change from mean, keeping other variables at mean/reference
prob_diffs = norm.cdf(fit_probit_ctr.params[1:4] + fit_probit_ctr.params[0]) - probit_baseline

# print
pd.DataFrame({'change in probability': prob_diffs}, index = np.array(['male', 'age', 'BMI']))

Optimization terminated successfully.
         Current function value: 0.214265
         Iterations 8

	change in probability
male	0.008659
age	0.001772
BMI	0.003587

Centering for interpretation

Now the coefficent interpretations are:

the estimated probability that a woman of average age and BMI has diabetes is 0.029
among people of average age and BMI, men are more likely than women to be diabetic with an estimated difference in probability of 0.009
a one-year increase from the average age is associated with a change in the estimated probability that a woman of average BMI is diabetic of 0.002
a one-unit increase in BMI from the average is associated with a change in the estimated probability that a woman of average age is diabetic of 0.004

Un/Supervised problems

Regression and classification are known as ‘supervised’ problems:

the response variable/outcome is observed
the modeling of data is guided by observation

By contrast, in ‘unsupervised’ problems:

the response variable/outcome is not observed
no ground truth to guide/supervise the modeling process

Clustering

Clustering is the unsupervised version of classification:

Can we classify observations into two or more groups based on variables without knowing the true grouping structure?

can think of this as modeling an unobserved response
however, not necessary that there exist subpopulations in the data – often a useful exploratory technique for exploring multimodal distributions

Voting records, 116th House

Roll call votes of the 116th House of Representatives on bills and resolutions:

Code

members = pd.read_csv('data/members.csv').set_index('name_id')
votes = pd.read_csv('data/votes-clean.csv').set_index('name_id')
vote_info = pd.read_csv('data/votes-info.csv').set_index('rollcall_id')

votes.head(3)

	2019:118	2019:217	2019:240	2019:134	2019:099	2019:184	2019:011	2019:067	2019:168	2019:557	...	2019:610	2019:535	2019:627	2019:538	2019:569	2019:570	2019:583	2019:626	2019:592	2019:624
name_id
A000374	-1	-1	-1	-1	-1	0	-1	1	0	1	...	1	0	1	0	1	1	-1	1	-1	-1
A000370	1	1	1	1	1	0	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
A000055	-1	-1	-1	-1	-1	-1	-1	1	1	1	...	1	1	1	1	1	1	-1	1	1	-1

3 rows × 144 columns

each column is a roll call ( total)
each row is a representative ( total)
1 is a “yes” vote; 0 is an abstention; -1 is a “no” vote

Clustering voting data

Question: Can we identify groups of representatives that voted similarly?

Can cluster the representatives according to roll call votes
But how many clusters to expect?

EDA with PCA

Projecting the data onto the first few principal components provides a way to visualize the data:

Code

pca = sm.PCA(votes)
alt.Chart(pca.scores).mark_circle(opacity = 0.5).encode(
    x = alt.X('comp_000', title = 'PC1'),
    y = alt.Y('comp_001', title = 'PC2')
).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
).configure_legend(
    labelFontSize = 14,
    titleFontSize = 16
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

can see at least two clusters

Clustering with -means

The most widely used clustering method is known as -means.

cluster labels are based on shortest Euclidean distance to one of centers
centers are found by minimizing the variance within each cluster

Clustering with -means

Technically, given observations of variables , the -means problem is:

A local solution is found by starting with random cluster assignments and then interatively:

Update the cluster centers
Reassign the cluster labels

Clustering voting data

The method is very easy to implement:

from sklearn.cluster import KMeans
np.random.seed(60623)

clust = KMeans(n_clusters = 2)
clust.fit(votes)
clust_labels = clust.predict(votes)

Cluster labels will be returned in the same order as the rows input to .predict()

Initialization is random, so solutions may differ from run to run (usually just permutes labels).

Clustering voting data

We could again visualize the clusters using PCA:

Code

plot_df = pca.scores.iloc[:, 0:2].copy()
plot_df['cluster'] = clust_labels

alt.Chart(plot_df).mark_circle(opacity = 0.5).encode(
    x = alt.X('comp_000', title = 'PC1'),
    y = alt.Y('comp_001', title = 'PC2'),
    color = 'cluster:N'
).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
).configure_legend(
    labelFontSize = 14,
    titleFontSize = 16
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Cluster composition by party

We could also cross-tabulate the cluster labels with party affiliations:

Code

label_df = pd.DataFrame({'cluster': clust_labels}, index = votes.index)

pd.merge(members, label_df, left_index = True, right_index = True).groupby(['current_party', 'cluster']).size().reset_index().pivot(columns = 'current_party', index = 'cluster')

	0
current_party	Democratic	Independent	Republican
cluster
0	232.0	NaN	3.0
1	NaN	1.0	194.0

Cluster composition by party

Who are those three representatives that vote with the democrats?

Code

members_labeled = pd.merge(members, label_df, left_index = True, right_index = True)

members_labeled[(members_labeled.cluster == 0) & (members_labeled.current_party == 'Republican')]

	name	state	url	chamber	current_party	committee_assignments	cluster
name_id
F000466	brian-fitzpatrick	Pennsylvania	https://www.congress.gov/member/brian-fitzpatr...	House	Republican	['Foreign Affairs', 'Transportation and Infras...	0
K000386	john-katko	New York	https://www.congress.gov/member/john-katko/K00...	House	Republican	['Homeland Security', 'Transportation and Infr...	0
S000522	christopher-smith	New Jersey	https://www.congress.gov/member/christopher-sm...	House	Republican	['Foreign Affairs']	0

Further possible questions

We could use the same technique to explore a variety of additional questions:

identify voting blocs by issue or policy area (use a subset of columns)
find within-party voting blocs (increase )
identify representatives that don’t tend to vote together with others (assign a score based on how ‘quickly’ a representative is isolated)