a one-unit increase in BMI from 26.45 (sample mean) for a 37.6 year old woman is associated with an estimated change in probability of diabetes of 0.0036
a one-unit increase in BMI from 27.45 (sample mean plus one) for a 37.6 year old woman is associated with an estimated change in probability of diabetes of 0.0039
Centering for interpretability
If explanatory variables are centered, then the change in estimated probability associated with a 1-unit change from the mean (and reference level(s)) is:
Φ(ˆβ0+ˆβj)−Φ(ˆβ0)
Refitting the model after centering age and BMI and computing the above yields:
Code
# center explanatory variablesx_vars_ctr = x_vars - x.mean()x_ctr = pd.concat([x.loc[:, ['const', 'male']], x_vars_ctr.loc[:, ['Age', 'BMI']]], axis =1)# fit modelmodel_probit_ctr = sm.Probit(endog = y, exog = x_ctr)fit_probit_ctr = model_probit_ctr.fit()# baselineprobit_baseline = norm.cdf(fit_probit_ctr.params[0])# changes in estimated probabilities associated with one-unit change from mean, keeping other variables at mean/referenceprob_diffs = norm.cdf(fit_probit_ctr.params[1:4] + fit_probit_ctr.params[0]) - probit_baseline# printpd.DataFrame({'change in probability': prob_diffs}, index = np.array(['male', 'age', 'BMI']))
Optimization terminated successfully.
Current function value: 0.214265
Iterations 8
change in probability
male
0.008659
age
0.001772
BMI
0.003587
Centering for interpretation
Now the coefficent interpretations are:
the estimated probability that a woman of average age and BMI has diabetes is 0.029
Φ(ˆβ0)
among people of average age and BMI, men are more likely than women to be diabetic with an estimated difference in probability of 0.009
0.008659=Φ(ˆβ0+ˆβ1)−Φ(ˆβ0)
a one-year increase from the average age is associated with a change in the estimated probability that a woman of average BMI is diabetic of 0.002
0.001772=Φ(ˆβ0+ˆβ2)−Φ(ˆβ0)
a one-unit increase in BMI from the average is associated with a change in the estimated probability that a woman of average age is diabetic of 0.004
0.003587=Φ(ˆβ0+ˆβ3)−Φ(ˆβ0)
Un/Supervised problems
Regression and classification are known as ‘supervised’ problems:
the response variable/outcome is observed
the modeling of data is guided by observation
By contrast, in ‘unsupervised’ problems:
the response variable/outcome is not observed
no ground truth to guide/supervise the modeling process
Clustering
Clustering is the unsupervised version of classification:
Can we classify observations into two or more groups based on p variables without knowing the true grouping structure?
can think of this as modeling an unobserved response
however, not necessary that there exist subpopulations in the data – often a useful exploratory technique for exploring multimodal distributions
Voting records, 116th House
Roll call votes of the 116th House of Representatives on bills and resolutions:
Code
members = pd.read_csv('data/members.csv').set_index('name_id')votes = pd.read_csv('data/votes-clean.csv').set_index('name_id')vote_info = pd.read_csv('data/votes-info.csv').set_index('rollcall_id')votes.head(3)
2019:118
2019:217
2019:240
2019:134
2019:099
2019:184
2019:011
2019:067
2019:168
2019:557
...
2019:610
2019:535
2019:627
2019:538
2019:569
2019:570
2019:583
2019:626
2019:592
2019:624
name_id
A000374
-1
-1
-1
-1
-1
0
-1
1
0
1
...
1
0
1
0
1
1
-1
1
-1
-1
A000370
1
1
1
1
1
0
1
1
1
1
...
1
1
1
1
1
1
1
1
1
1
A000055
-1
-1
-1
-1
-1
-1
-1
1
1
1
...
1
1
1
1
1
1
-1
1
1
-1
3 rows × 144 columns
each column is a roll call (p=144 total)
each row is a representative (n=430 total)
1 is a “yes” vote; 0 is an abstention; -1 is a “no” vote
Clustering voting data
Question: Can we identify groups of representatives that voted similarly?
Can cluster the representatives according to roll call votes
But how many clusters to expect?
EDA with PCA
Projecting the data onto the first few principal components provides a way to visualize the data:
Code
pca = sm.PCA(votes)alt.Chart(pca.scores).mark_circle(opacity =0.5).encode( x = alt.X('comp_000', title ='PC1'), y = alt.Y('comp_001', title ='PC2')).configure_axis( labelFontSize =14, titleFontSize =16).configure_legend( labelFontSize =14, titleFontSize =16)