# Initialize Otter
import otter
= otter.Notebook("lab5-pca.ipynb") grader
Lab 5: Principal components
import numpy as np
import pandas as pd
import altair as alt
from scipy import linalg
from statsmodels.multivariate.pca import PCA
# disable row limit for plotting
alt.data_transformers.disable_max_rows()# uncomment to ensure graphics display with pdf export
# alt.renderers.enable('mimetype')
Lab 5: Principal components
Principal components analysis (PCA) is a widely-used multivariate analysis technique. Depending on the application, PCA is variously described as:
- a dimension reduction method
- a an approximation method
- a latent factor model
- a filtering or compression method
The core technique of PCA is finding linear data transformations that preserve variance.
What does it mean to say that ‘principal components are linear data transformations’? Suppose you have a dataset with \(n\) observations and \(p\) variables. We can represent the values as a data matrix \(\mathbf{X}\) with \(n\) rows and \(p\) columns:
\[ \mathbf{X} = \underbrace{\left[\begin{array}{cccc} \mathbf{x}_1 &\mathbf{x}_2 &\cdots &\mathbf{x}_p \end{array}\right]}_{\text{column vectors}} = \left[\begin{array}{cccc} x_{11} &x_{12} &\cdots &x_{1p} \\ x_{21} &x_{22} &\cdots &x_{2p} \\ \vdots &\vdots &\ddots &\vdots \\ x_{n1} &x_{n2} &\cdots &x_{np} \end{array}\right] \]
To say that the principal components are linear data transformations means that each principal component is of the form:
\[ \text{PC} = \mathbf{Xv} = v_1 \mathbf{x}_1 + v_2 \mathbf{x}_2 + \cdots + v_p \mathbf{x}_p \]
for some vector \(\mathbf{v}\). In PCA, the following terminology is used:
- linear combination coefficients \(v_j\) are known as loadings
- values of the linear combinations are known as scores
- the vector of loadings \(\mathbf{v}\) is known as a principal axis
As discussed in lecture, the values of the loadings are found by decomposing the correlation structure.
Objectives
In this lab, you’ll focus on computing and interpreting principal components:
- finding the loadings (linear combination coefficients) for each PC;
- quantifying the variation captured by each PC;
- visualization-based techniques for selecting a number of PC’s to A(nalyze);
- plotting and interpreting loadings.
You’ll work with a selection of county summaries from the 2010 U.S. census. The first few rows of the dataset are shown below:
# import tidy county-level 2010 census data
= pd.read_csv('data/census2010.csv', encoding = 'latin1')
census census.head()
The observational units are U.S. counties, and each row is an observation on one county. The values are, for the most part, percentages of the county population. You can find variable descriptions in the metadata file census2010metadata.csv
in the data directory.
Correlations
PCA identifies variable combinations that capture covariation by decomposing the correlation matrix. So, to start with, let’s examine the correlation matrix for the 2010 county-level census data to get a sense of which variables tend to vary together.
The correlation matrix is a matrix of all pairwise correlations between variables. If \(x_ij\) denotes the value for the \(i\)th observation of variable \(j\), then the entry at row \(j\) and column \(k\) of the correlation matrix \(\mathbf{R}\) is:
\[r_{jk} = \frac{\sum_i (x_{ij} - \bar{x}_j)(x_{ik} - \bar{x}_k)}{S_j S_k}\]
In the census data, the State
and County
columns indicate the geographic region for each observation; essentially, they are a row index. So we’ll drop them before computing the matrix \(\mathbf{R}\):
# store quantitative variables separately
= census.drop(columns = ['State', 'County']) x_mx
From here, the matrix is simple to compute in pandas using .corr()
:
# correlation matrix
= x_mx.corr() corr_mx
The matrix can be inspected directly to determine which variables vary together. For example, we could look at the correlations between employment rate and every other variable in the dataset by extracting the Employed
column from the correlation matrix and sorting the correlations:
# correlation between employment rate and other variables
'Employed'].sort_values() corr_mx.loc[:,
Recall that correlation is a number in the interval [-1, 1] whose magnitude indicates the strength of the linear relationship between variables:
- correlations near -1 are strongly negative, and mean that the variables tend to vary in opposition
- correlations near 1 are strongly positive, and mean that the variables tend to vary together
From examining the output above, it can be seen that the percentage of the county population that is employed is:
- strongly negatively correlated with child poverty, poverty, and unemployment, meaning it tends to vary in opposition with these variables
- strongly positively correlated with income per capita, meaning it tends to vary together with this variable
If instead we wanted to look up the correlation between just two variables, we could retrieve the relevant entry directly using .loc[...]
with the variable names:
# correlation between employment and income per capita
'Employed', 'IncomePerCap'] corr_mx.loc[
So across U.S. counties employment is, perhaps unsurprisingly, strongly and positively correlated with income per capita, meaning that higher employment rates tend to coincide with higher incomes per capita.
Question 1
Find the correlation between the poverty rate and demographic minority rate and store the value as pov_dem_rate
. Interpret the value in context.
Type your answer here, replacing this text.
# correlation between poverty and percent minority
= ...
pov_dem_rate
# print
pov_dem_rate
"q1") grader.check(
While direct inspection is useful, it can be cumbersome to check correlations for a large number of variables this way. A heatmap – a colored image of the matrix – provides a (sometimes) convenient way to see what’s going on without having to examine the numerical values directly. The cell below shows one way of constructing this plot. Notice the diverging color scale; this should always be used.
# melt corr_mx
= corr_mx.reset_index().rename(
corr_mx_long = {'index': 'row'}
columns
).melt(= 'row',
id_vars = 'col',
var_name = 'Correlation'
value_name
)
# construct plot
alt.Chart(corr_mx_long).mark_rect().encode(= alt.X('col', title = '', sort = {'field': 'Correlation', 'order': 'ascending'}),
x = alt.Y('row', title = '', sort = {'field': 'Correlation', 'order': 'ascending'}),
y = alt.Color('Correlation',
color = alt.Scale(scheme = 'blueorange', # diverging gradient
scale = (-1, 1), # ensure white = 0
domain type = 'sqrt'), # adjust gradient scale
= alt.Legend(tickCount = 5)) # add ticks to colorbar at 0.5 for reference
legend = 300, height = 300) ).properties(width
Question 2
Which variable is self employment rate most positively correlated with? Refer to the heatmap.
Type your answer here, replacing this text.
Computing principal components
Each principal component is of the form:
\[\text{PC}_{i} = \sum_j v_j x_{ij} \quad(\text{PC score for observation } i)\]
The loading \(v_j\) for each component indicate which variables are most influential (heavily weighted) on that principal axis, and thus offer an indirect picture of which variables are driving variation and covariation in the original data.
Loadings and scores
In statsmodels
, the module multivariate.pca
contains an easy-to-use implementation.
# compute principal components
= PCA(data = x_mx, standardize = True) pca
Most quantities you might want to use in PCA can be retrieved as attributes of pca
. In particular:
.loadings
contains the loadings.scores
contains the scores.eigenvals
contains the variances along each principal axis (see lecture notes)
Examine the loadings below. Each column gives the loadings for one principal component; components are ordered from largest to smallest variance.
# inspect loadings
pca.loadings
Similarly, inspect the scores below and check your understanding; each row is an observation and the columns give the scores on each principal axis.
# inspect scores
pca.scores
Importantly, statsmodels
rescales the scores so that they have unit inner product; in other words, so that the variances are all \(\frac{1}{n - 1}\).
# variance of scores
pca.scores.var()
# for comparison
1/(x_mx.shape[0] - 1)
To change this behavior, set normalize = False
when computing the principal components.
Question 3
Check your understanding. Which variable contributes most to the sixth principal component? Store the variable name exactly as it appears among the original column names as pc6_most_influential_variable
, and store the corresponding loading as pc6_most_influential_variable_loading
. Print the variable name.
# find most influential variable
= ...
pc6_most_influential_variable
# find loading
= ...
pc6_most_influential_variable_loading
# print
...
"q3") grader.check(
Variance ratios
The variance ratios indicate the proportions of total variance in the data captured by each principal axis. You may recall from lecture that the variance ratios are computed from the eigenvalues of the correlation (or covariance, if data are not standardized) matrix.
When using statsmodels
, these need to be computed manually.
# compute variance ratios
= pca.eigenvals/pca.eigenvals.sum()
var_ratios
# print
var_ratios
Note again that the principal components have been computed in order of decreasing variance.
Question 4
Check your understanding. What proportion of variance is captured jointly by the first three components taken together? Provide a calculation to justify your answer.
Type your answer here, replacing this text.
...
Selecting a subset of PCs
PCA generally consists of choosing a small subset of components. The basic strategy for selecting this subset is to determine how many are needed to capture some analyst-chosen minimum portion of total variance in the original data.
Most often this assessment is made graphically by inspecting the variance ratios and their cumulative sum, i.e., the amount of total variation captured jointly by subsets of successive components. We’ll store these quantities in a data frame.
# store proportion of variance explained as a dataframe
= pd.DataFrame({
pca_var_explained 'Component': np.arange(1, 23),
'Proportion of variance explained': var_ratios})
# add cumulative sum
'Cumulative variance explained'] = var_ratios.cumsum()
pca_var_explained[
# print
pca_var_explained.head()
Now we’ll make a dual-axis plot showing, on one side, the proportion of variance explained (y) as a function of component (x), and on the other side, the cumulative variance explained (y) also as a function of component (x). Make sure that you’ve completed Q1(a) before running the next cell.
# encode component axis only as base layer
= alt.Chart(pca_var_explained).encode(
base = 'Component')
x
# make a base layer for the proportion of variance explained
= base.encode(
prop_var_base = alt.Y('Proportion of variance explained',
y = alt.Axis(titleColor = '#57A44C'))
axis
)
# make a base layer for the cumulative variance explained
= base.encode(
cum_var_base = alt.Y('Cumulative variance explained', axis = alt.Axis(titleColor = '#5276A7'))
y
)
# add points and lines to each base layer
= prop_var_base.mark_line(stroke = '#57A44C') + prop_var_base.mark_point(color = '#57A44C')
prop_var = cum_var_base.mark_line() + cum_var_base.mark_point()
cum_var
# layer the layers
= alt.layer(prop_var, cum_var).resolve_scale(y = 'independent')
var_explained_plot
# display
var_explained_plot
The purpose of making this plot is to quickly determine the fewest number of principal components that capture a considerable portion of variation and covariation. ‘Considerable’ here is a bit subjective.
Question 5
How many principal components explain more than 6% of total variation individually? Store this number as num_pc
, and store the proportion of variation that they capture jointly as var_explained
.
# number of selected components
= ...
num_pc
# variance explained
= ...
var_explained
#print
print('number selected: ', num_pc)
print('proportion of variance captured: ', var_explained)
"q5") grader.check(
Interpreting loadings
Now that you’ve chosen the number of components to work with, the next step is to examine loadings to understand just which variables the components combine with significant weight.
We’ll store the scores for the components you selected as a dataframe.
# subset loadings
= pca.loadings.iloc[:, 0:num_pc]
loading_df
# rename columns
= loading_df.rename(columns = dict(zip(loading_df.columns, ['PC' + str(i) for i in range(1, num_pc + 1)])))
loading_df
# print
loading_df.head()
Again, the loadings are the weights with which the variables are combined to form the principal components. For example, the PC1
column tells us that this component is equal to:
\[(-0.020055\times\text{women}) + (0.289614\times\text{white}) + (0.050698\times\text{citizen}) + \dots\]
Since the components together capture over half the total variation, the heavily weighted variables in the selected components are the ones that drive variation in the original data.
By visualizing the loadings, we can see which variables are most influential for each component, and thereby also which variables seem to drive total variation in the data.
# melt from wide to long
= loading_df.reset_index().melt(
loading_plot_df = 'index',
id_vars = 'Principal Component',
var_name = 'Loading'
value_name = {'index': 'Variable'})
).rename(columns
# add a column of zeros to encode for x = 0 line to plot
'zero'] = np.repeat(0, len(loading_plot_df))
loading_plot_df[
# create base layer
= alt.Chart(loading_plot_df)
base
# create lines + points for loadings
= base.mark_line(point = True).encode(
loadings = alt.X('Variable', title = ''),
y = 'Loading',
x = 'Principal Component'
color
)
# create line at zero
= base.mark_rule().encode(x = alt.X('zero', title = 'Loading'), size = alt.value(0.05))
rule
# layer
= (loadings + rule).properties(width = 120)
loading_plot
# show
= alt.Column('Principal Component', title = '')) loading_plot.facet(column
Look first at PC1: the variables with the largest loadings (points farthest in either direction from the zero line) are Child Poverty (positive), Employed (negative), Income per capita (negative), Poverty (positive), and Unemployment (positive). We know from exploring the correlation matrix that employment rate, unemployment rate, and income per capita are all related, and similarly child poverty rate and poverty rate are related. Therefore, the positively-loaded variables are all measuring more or less the same thing, and likewise for the negatively-loaded variables.
Essentially, then, PC1 is predominantly (but not entirely) a representation of income and poverty. In particular, counties have a higher value for PC1 if they have lower-than-average income per capita and higher-than-average poverty rates, and a smaller value for PC1 if they have higher-than-average income per capita and lower-than-average poverty rates.
A system for loading interpretation
Often interpreting principal components can be difficult, and sometimes there’s no clear interpretation available! That said, it helps to have a system instead of staring at the plot and scratching our heads. Here is a semi-systematic approach to interpreting loadings:
- Divert your attention away from the zero line.
- Find the largest positive loading, and list all variables with similar loadings.
- Find the largest negative loading, and list all variables with similar loadings.
- The principal component represents the difference between the average of the first set and the average of the second set.
- Try to come up with a description of less than 4 words.
This system is based on the following ideas: * a high loading value (negative or positive) indicates that a variable strongly influences the principal component; * a negative loading value indicates that + increases in the value of a variable decrease the value of the principal component + and decreases in the value of a variable increase the value of the principal component; * a positive loading value indicates that + increases in the value of a variable increase the value of the principal component + and decreases in the value of a variable decrease the value of the principal component; * similar loadings between two or more variables indicate that the principal component reflects their average; * divergent loadings between two sets of variables indicates that the principal component reflects their difference.
Question 6
Work with your neighbor to interpret PC2. Come up with a name for the component and explain which variables are most influential.
Type your answer here, replacing this text.
Standardization
Data are typically standardized because otherwise the variables on the largest scales tend to dominate the principal components, and most of the time PC1 will capture the majority of the variation. However, that is artificial. In the census data, income per capita has the largest magnitudes, and thus, the highest variance.
# three largest variances
= False).head(3) x_mx.var().sort_values(ascending
When PCs are computed without normalization, the total variation is mostly just the variance of income per capita because it is orders of magnitude larger than the variance of any other variable. But that’s just because of the scale of the variable – incomes per capita are large numbers – not a reflection that it varies more or less than the other variables.
Run the cell below to see what happens to the variance ratios if the data are not normalized.
# recompute pcs without normalization
= PCA(data = x_mx, standardize = False)
pca_unscaled
# show variance ratios for first three pcs
0:3]/pca_unscaled.eigenvals.sum() pca_unscaled.eigenvals[
Further, let’s look at the loadings when data are not standardized:
# subset loadings
= pca_unscaled.loadings.iloc[:, 0:2]
unscaled_loading_df
# rename columns
= unscaled_loading_df.rename(
unscaled_loading_df = dict(zip(unscaled_loading_df.columns, ['PC' + str(i) for i in range(1, 3)]))
columns
)
# melt from wide to long
= unscaled_loading_df.reset_index().melt(
unscaled_loading_plot_df = 'index',
id_vars = 'Principal Component',
var_name = 'Loading'
value_name
).rename(= {'index': 'Variable'}
columns
)
# add a column of zeros to encode for x = 0 line to plot
'zero'] = np.repeat(0, len(unscaled_loading_plot_df))
unscaled_loading_plot_df[
# create base layer
= alt.Chart(unscaled_loading_plot_df)
base
# create lines + points for loadings
= base.mark_line(point = True).encode(
loadings = alt.X('Variable', title = ''),
y = 'Loading',
x = 'Principal Component'
color
)
# create line at zero
= base.mark_rule().encode(x = alt.X('zero', title = 'Loading'), size = alt.value(0.05))
rule
# layer
= (loadings + rule).properties(width = 120, title = 'Loadings from unscaled PCA')
loading_plot
# show
= alt.Column('Principal Component', title = '')) loading_plot.facet(column
Notice that the variables with nonzero loadings in unscaled PCA are simply the three variables with the largest variances.
# three largest variances
= False).head(3) x_mx.var().sort_values(ascending
Exploratory analysis based on PCA
Now that we have the principal components, we can use them for exploratory data visualizations. To this end, let’s retrieve the scores from the components you selected:
# subset scores
= pca.scores.iloc[:, 0:num_pc]
score_df
# rename columns
= score_df.rename(
score_df = dict(zip(score_df.columns, ['PC' + str(i) for i in range(1, num_pc + 1)]))
columns
)
# add state and county
'State', 'County']] = census[['State', 'County']]
score_df[[
# print
score_df.head()
The PC’s can be used to construct scatterplots of the data and search for patterns. We’ll illustrate that by identifying some outliers. The cell below plots PC2 (employment type) against PC4 (carpooling?):
# base chart
= alt.Chart(score_df)
base
# data scatter
= base.mark_point(opacity = 0.2).encode(
scatter = alt.X('PC2:Q', title = 'Self-employment PC'),
x = alt.Y('PC4:Q', title = 'Carpooling PC')
y
)
# show
scatter
Notice that there are a handful of outlying points in the upper right region away from the dense scatter. What are those?
In order to inspect the outlying counties, we first need to figure out how to identify them. The outlying values have a large sum of PC2 and PC4. We can distinguish them by finding a cutoff value for the sum; a simple quantile will do.
# find cutoff value
= (score_df.PC2 + score_df.PC4)
pc2_pc4_sum = pc2_pc4_sum.quantile(0.99999)
cutoff
# store outlying rows using cutoff
= score_df[(-score_df.PC2 + score_df.PC4) > cutoff]
outliers
# plot outliers in red
= alt.Chart(outliers).mark_circle(
pts = 'red',
color = 0.3
opacity
).encode(= 'PC2',
x = 'PC4'
y
)
# layer
+ pts scatter
Notice that almost all the outlying counties are remote regions of Alaska:
outliers
What sets them apart? The cell below retrieves the normalized data and county name for the outlying rows, and then plots the Standardized values of each variable for all 9 counties as vertical ticks, along with a point indicating the mean for the outlying counties. This plot can be used to determine which variables are over- or under-average for the outlying counties relative to the nation by simply locating means that are far from zero in either direction.
= (x_mx - x_mx.mean())/x_mx.std()
x_ctr
# retrieve normalized data for outlying rows
= x_ctr.loc[outliers.index.values].join(
outlier_data 'County']]
census.loc[outliers.index.values, [
)
# melt to long format for plotting
= outlier_data.melt(
outlier_plot_df = 'County',
id_vars = 'Variable',
var_name = 'Standardized value'
value_name
)
# plot ticks for values (x) for each variable (y)
= alt.Chart(outlier_plot_df).mark_tick().encode(
ticks = 'Standardized value',
x = 'Variable'
y
)
# shade out region within 3SD of mean
= alt.Chart(
grey
pd.DataFrame('Variable': x_ctr.columns,
{'upr': np.repeat(3, 22),
'lwr': np.repeat(-3, 22)}
)= 0.2, color = 'gray').encode(
).mark_area(opacity = 'Variable',
y = alt.X('upr', title = 'Standardized value'),
x = 'lwr'
x2
)
# compute means of each variable across counties
= alt.Chart(outlier_plot_df).transform_aggregate(
means = 'mean(Standardized value)',
group_mean = ['Variable']
groupby
).transform_calculate(= 'abs(datum.group_mean) > 3'
large = 80).encode(
).mark_circle(size = 'group_mean:Q',
x = 'Variable',
y = alt.Color('large:N', legend = None)
color
)
# layer
+ grey + means ticks
Question 7
The two variables that clearly set the outlying counties apart from the nation are the percentage of the population using alternative transportation (extremely above average) and the percentage that drive to work (extremely below average). What about those counties explains this?
(Hint: take a peek at the Wikipedia page on transportation in Alaska.)
Type your answer here, replacing this text.
Submission
- Save the notebook.
- Restart the kernel and run all cells. (CAUTION: if your notebook is not saved, you will lose your work.)
- Carefully look through your notebook and verify that all computations execute correctly and all graphics are displayed clearly. You should see no errors; if there are any errors, make sure to correct them before you submit the notebook.
- Download the notebook as an
.ipynb
file. This is your backup copy. - Export the notebook as PDF and upload to Gradescope.
To double-check your work, the cell below will rerun all of the autograder tests.
grader.check_all()