```
# Initialize Otter
import otter
= otter.Notebook("lab5-pca.ipynb") grader
```

# Lab 5: Principal components

```
import numpy as np
import pandas as pd
import altair as alt
from scipy import linalg
from statsmodels.multivariate.pca import PCA
# disable row limit for plotting
alt.data_transformers.disable_max_rows()# uncomment to ensure graphics display with pdf export
# alt.renderers.enable('mimetype')
```

Principal components analysis (PCA) is a widely-used multivariate analysis technique. Depending on the application, PCA is variously described as:

- a dimension reduction method
- a an approximation method
- a latent factor model
- a filtering or compression method

The core technique of PCA is *finding linear data transformations that preserve variance*.

What does it mean to say that ‘principal components are linear data transformations’? Suppose you have a dataset with \(n\) observations and \(p\) variables. We can represent the values as a data matrix \(\mathbf{X}\) with \(n\) rows and \(p\) columns:

\[ \mathbf{X} = \underbrace{\left[\begin{array}{cccc} \mathbf{x}_1 &\mathbf{x}_2 &\cdots &\mathbf{x}_p \end{array}\right]}_{\text{column vectors}} = \left[\begin{array}{cccc} x_{11} &x_{12} &\cdots &x_{1p} \\ x_{21} &x_{22} &\cdots &x_{2p} \\ \vdots &\vdots &\ddots &\vdots \\ x_{n1} &x_{n2} &\cdots &x_{np} \end{array}\right] \]

To say that the principal components are linear data transformations means that each principal component is of the form:

\[ \text{PC} = \mathbf{Xv} = v_1 \mathbf{x}_1 + v_2 \mathbf{x}_2 + \cdots + v_p \mathbf{x}_p \]

for some vector \(\mathbf{v}\). In PCA, the following terminology is used:

- linear combination coefficients \(v_j\) are known as
*loadings* - values of the linear combinations are known as
*scores* - the vector of loadings \(\mathbf{v}\) is known as a
*principal axis*

As discussed in lecture, the values of the loadings are found by decomposing the correlation structure.

**Objectives**

In this lab, you’ll focus on computing and interpreting principal components:

- finding the loadings (linear combination coefficients) for each PC;
- quantifying the variation captured by each PC;
- visualization-based techniques for selecting a number of PC’s to A(nalyze);
- plotting and interpreting loadings.

You’ll work with a selection of county summaries from the 2010 U.S. census. The first few rows of the dataset are shown below:

```
# import tidy county-level 2010 census data
= pd.read_csv('data/census2010.csv', encoding = 'latin1')
census census.head()
```

The observational units are U.S. counties, and each row is an observation on one county. The values are, for the most part, percentages of the county population. You can find variable descriptions in the metadata file `census2010metadata.csv`

in the data directory.

# Correlations

PCA identifies variable combinations that capture covariation by decomposing the correlation matrix. So, to start with, let’s examine the correlation matrix for the 2010 county-level census data to get a sense of which variables tend to vary together.

The correlation matrix is a matrix of all pairwise correlations between variables. If \(x_ij\) denotes the value for the \(i\)th observation of variable \(j\), then the entry at row \(j\) and column \(k\) of the correlation matrix \(\mathbf{R}\) is:

\[r_{jk} = \frac{\sum_i (x_{ij} - \bar{x}_j)(x_{ik} - \bar{x}_k)}{S_j S_k}\]

In the census data, the `State`

and `County`

columns indicate the geographic region for each observation; essentially, they are a row index. So we’ll drop them before computing the matrix \(\mathbf{R}\):

```
# store quantitative variables separately
= census.drop(columns = ['State', 'County']) x_mx
```

From here, the matrix is simple to compute in pandas using `.corr()`

:

```
# correlation matrix
= x_mx.corr() corr_mx
```

The matrix can be inspected directly to determine which variables vary together. For example, we could look at the correlations between employment rate and every other variable in the dataset by extracting the `Employed`

column from the correlation matrix and sorting the correlations:

```
# correlation between employment rate and other variables
'Employed'].sort_values() corr_mx.loc[:,
```

Recall that correlation is a number in the interval [-1, 1] whose magnitude indicates the strength of the linear relationship between variables:

- correlations near -1 are
*strongly negative*, and mean that the variables*tend to vary in opposition* - correlations near 1 are
*strongly positive*, and mean that the variables*tend to vary together*

From examining the output above, it can be seen that the percentage of the county population that is employed is:

- strongly
*negatively*correlated with child poverty, poverty, and unemployment, meaning it*tends to vary in opposition*with these variables - strongly
*positively*correlated with income per capita, meaning it*tends to vary together*with this variable

If instead we wanted to look up the correlation between just two variables, we could retrieve the relevant entry directly using `.loc[...]`

with the variable names:

```
# correlation between employment and income per capita
'Employed', 'IncomePerCap'] corr_mx.loc[
```

So across U.S. counties employment is, perhaps unsurprisingly, strongly and positively correlated with income per capita, meaning that higher employment rates tend to coincide with higher incomes per capita.

### Question 1

Find the correlation between the poverty rate and demographic minority rate and store the value as `pov_dem_rate`

. Interpret the value in context.

*Type your answer here, replacing this text.*

```
# correlation between poverty and percent minority
= ...
pov_dem_rate
# print
pov_dem_rate
```

`"q1") grader.check(`

While direct inspection is useful, it can be cumbersome to check correlations for a large number of variables this way. A heatmap – a colored image of the matrix – provides a (sometimes) convenient way to see what’s going on without having to examine the numerical values directly. The cell below shows one way of constructing this plot. Notice the diverging color scale; this should always be used.

```
# melt corr_mx
= corr_mx.reset_index().rename(
corr_mx_long = {'index': 'row'}
columns
).melt(= 'row',
id_vars = 'col',
var_name = 'Correlation'
value_name
)
# construct plot
alt.Chart(corr_mx_long).mark_rect().encode(= alt.X('col', title = '', sort = {'field': 'Correlation', 'order': 'ascending'}),
x = alt.Y('row', title = '', sort = {'field': 'Correlation', 'order': 'ascending'}),
y = alt.Color('Correlation',
color = alt.Scale(scheme = 'blueorange', # diverging gradient
scale = (-1, 1), # ensure white = 0
domain type = 'sqrt'), # adjust gradient scale
= alt.Legend(tickCount = 5)) # add ticks to colorbar at 0.5 for reference
legend = 300, height = 300) ).properties(width
```

### Question 2

Which variable is self employment rate most *positively* correlated with? Refer to the heatmap.

*Type your answer here, replacing this text.*

# Computing principal components

Each principal component is of the form:

\[\text{PC}_{i} = \sum_j v_j x_{ij} \quad(\text{PC score for observation } i)\]

The loading \(v_j\) for each component indicate which variables are most influential (heavily weighted) on that principal axis, and thus offer an indirect picture of which variables are driving variation and covariation in the original data.

## Loadings and scores

In `statsmodels`

, the module `multivariate.pca`

contains an easy-to-use implementation.

```
# compute principal components
= PCA(data = x_mx, standardize = True) pca
```

Most quantities you might want to use in PCA can be retrieved as attributes of `pca`

. In particular:

`.loadings`

contains the loadings`.scores`

contains the scores`.eigenvals`

contains the variances along each principal axis (see lecture notes)

Examine the loadings below. Each column gives the loadings for one principal component; components are ordered from largest to smallest variance.

```
# inspect loadings
pca.loadings
```

Similarly, inspect the scores below and check your understanding; each row is an observation and the columns give the scores on each principal axis.

```
# inspect scores
pca.scores
```

Importantly, `statsmodels`

rescales the scores so that they have unit inner product; in other words, so that the variances are all \(\frac{1}{n - 1}\).

```
# variance of scores
pca.scores.var()
```

```
# for comparison
1/(x_mx.shape[0] - 1)
```

To change this behavior, set `normalize = False`

when computing the principal components.

### Question 3

Check your understanding. Which variable contributes most to the sixth principal component? Store the variable name exactly as it appears among the original column names as `pc6_most_influential_variable`

, and store the corresponding loading as `pc6_most_influential_variable_loading`

. Print the variable name.

```
# find most influential variable
= ...
pc6_most_influential_variable
# find loading
= ...
pc6_most_influential_variable_loading
# print
...
```

`"q3") grader.check(`

## Variance ratios

The *variance ratios* indicate the proportions of total variance in the data captured by each principal axis. You may recall from lecture that the variance ratios are computed from the eigenvalues of the correlation (or covariance, if data are not standardized) matrix.

When using `statsmodels`

, these need to be computed manually.

```
# compute variance ratios
= pca.eigenvals/pca.eigenvals.sum()
var_ratios
# print
var_ratios
```

Note again that the principal components have been computed in order of *decreasing* variance.

### Question 4

Check your understanding. What proportion of variance is captured *jointly* by the first three components taken together? Provide a calculation to justify your answer.

*Type your answer here, replacing this text.*

` ...`

## Selecting a subset of PCs

PCA generally consists of choosing a small subset of components. The basic strategy for selecting this subset is to determine how many are needed to capture some analyst-chosen minimum portion of total variance in the original data.

Most often this assessment is made graphically by inspecting the variance ratios and their cumulative sum, *i.e.*, the amount of total variation captured jointly by subsets of successive components. We’ll store these quantities in a data frame.

```
# store proportion of variance explained as a dataframe
= pd.DataFrame({
pca_var_explained 'Component': np.arange(1, 23),
'Proportion of variance explained': var_ratios})
# add cumulative sum
'Cumulative variance explained'] = var_ratios.cumsum()
pca_var_explained[
# print
pca_var_explained.head()
```

Now we’ll make a dual-axis plot showing, on one side, the proportion of variance explained (y) as a function of component (x), and on the other side, the cumulative variance explained (y) also as a function of component (x). Make sure that you’ve completed Q1(a) before running the next cell.

```
# encode component axis only as base layer
= alt.Chart(pca_var_explained).encode(
base = 'Component')
x
# make a base layer for the proportion of variance explained
= base.encode(
prop_var_base = alt.Y('Proportion of variance explained',
y = alt.Axis(titleColor = '#57A44C'))
axis
)
# make a base layer for the cumulative variance explained
= base.encode(
cum_var_base = alt.Y('Cumulative variance explained', axis = alt.Axis(titleColor = '#5276A7'))
y
)
# add points and lines to each base layer
= prop_var_base.mark_line(stroke = '#57A44C') + prop_var_base.mark_point(color = '#57A44C')
prop_var = cum_var_base.mark_line() + cum_var_base.mark_point()
cum_var
# layer the layers
= alt.layer(prop_var, cum_var).resolve_scale(y = 'independent')
var_explained_plot
# display
var_explained_plot
```

The purpose of making this plot is to quickly determine the fewest number of principal components that capture a considerable portion of variation and covariation. ‘Considerable’ here is a bit subjective.

### Question 5

How many principal components explain more than 6% of total variation individually? Store this number as `num_pc`

, and store the proportion of variation that they capture jointly as `var_explained`

.

```
# number of selected components
= ...
num_pc
# variance explained
= ...
var_explained
#print
print('number selected: ', num_pc)
print('proportion of variance captured: ', var_explained)
```

`"q5") grader.check(`

## Interpreting loadings

Now that you’ve chosen the number of components to work with, the next step is to examine loadings to understand just *which* variables the components combine with significant weight.

We’ll store the scores for the components you selected as a dataframe.

```
# subset loadings
= pca.loadings.iloc[:, 0:num_pc]
loading_df
# rename columns
= loading_df.rename(columns = dict(zip(loading_df.columns, ['PC' + str(i) for i in range(1, num_pc + 1)])))
loading_df
# print
loading_df.head()
```

Again, the loadings are the *weights* with which the variables are combined to form the principal components. For example, the `PC1`

column tells us that this component is equal to:

\[(-0.020055\times\text{women}) + (0.289614\times\text{white}) + (0.050698\times\text{citizen}) + \dots\]

Since the components together capture over half the total variation, the heavily weighted variables in the selected components are the ones that drive variation in the original data.

By visualizing the loadings, we can see which variables are most influential for each component, and thereby also which variables seem to drive total variation in the data.

```
# melt from wide to long
= loading_df.reset_index().melt(
loading_plot_df = 'index',
id_vars = 'Principal Component',
var_name = 'Loading'
value_name = {'index': 'Variable'})
).rename(columns
# add a column of zeros to encode for x = 0 line to plot
'zero'] = np.repeat(0, len(loading_plot_df))
loading_plot_df[
# create base layer
= alt.Chart(loading_plot_df)
base
# create lines + points for loadings
= base.mark_line(point = True).encode(
loadings = alt.X('Variable', title = ''),
y = 'Loading',
x = 'Principal Component'
color
)
# create line at zero
= base.mark_rule().encode(x = alt.X('zero', title = 'Loading'), size = alt.value(0.05))
rule
# layer
= (loadings + rule).properties(width = 120)
loading_plot
# show
= alt.Column('Principal Component', title = '')) loading_plot.facet(column
```

Look first at PC1: the variables with the largest loadings (points farthest in either direction from the zero line) are Child Poverty (positive), Employed (negative), Income per capita (negative), Poverty (positive), and Unemployment (positive). We know from exploring the correlation matrix that employment rate, unemployment rate, and income per capita are all related, and similarly child poverty rate and poverty rate are related. Therefore, the positively-loaded variables are all measuring more or less the same thing, and likewise for the negatively-loaded variables.

Essentially, then, PC1 is predominantly (but not entirely) a representation of income and poverty. In particular, counties have a higher value for PC1 if they have lower-than-average income per capita and higher-than-average poverty rates, and a smaller value for PC1 if they have higher-than-average income per capita and lower-than-average poverty rates.

### A system for loading interpretation

Often interpreting principal components can be difficult, and sometimes there’s no clear interpretation available! That said, it helps to have a system instead of staring at the plot and scratching our heads. Here is a semi-systematic approach to interpreting loadings:

- Divert your attention away from the zero line.
- Find the largest positive loading, and list all variables with similar loadings.
- Find the largest negative loading, and list all variables with similar loadings.
- The principal component represents the difference between the average of the first set and the average of the second set.
- Try to come up with a description of less than 4 words.

This system is based on the following ideas: * a high loading value (negative or positive) indicates that a variable strongly influences the principal component; * a negative loading value indicates that + increases in the value of a variable *decrease* the value of the principal component + and decreases in the value of a variable *increase* the value of the principal component; * a positive loading value indicates that + increases in the value of a variable *increase* the value of the principal component + and decreases in the value of a variable *decrease* the value of the principal component; * similar loadings between two or more variables indicate that the principal component reflects their *average*; * divergent loadings between two sets of variables indicates that the principal component reflects their *difference*.

### Question 6

Work with your neighbor to interpret PC2. Come up with a name for the component and explain which variables are most influential.

*Type your answer here, replacing this text.*

## Standardization

Data are typically standardized because otherwise the variables on the largest scales tend to dominate the principal components, and most of the time PC1 will capture the majority of the variation. However, that is artificial. In the census data, income per capita has the largest magnitudes, and thus, the highest variance.

```
# three largest variances
= False).head(3) x_mx.var().sort_values(ascending
```

When PCs are computed without normalization, the total variation is mostly just the variance of income per capita because it is orders of magnitude larger than the variance of any other variable. But that’s just because of the *scale* of the variable – incomes per capita are large numbers – not a reflection that it varies more or less than the other variables.

Run the cell below to see what happens to the variance ratios if the data are not normalized.

```
# recompute pcs without normalization
= PCA(data = x_mx, standardize = False)
pca_unscaled
# show variance ratios for first three pcs
0:3]/pca_unscaled.eigenvals.sum() pca_unscaled.eigenvals[
```

Further, let’s look at the loadings when data are not standardized:

```
# subset loadings
= pca_unscaled.loadings.iloc[:, 0:2]
unscaled_loading_df
# rename columns
= unscaled_loading_df.rename(
unscaled_loading_df = dict(zip(unscaled_loading_df.columns, ['PC' + str(i) for i in range(1, 3)]))
columns
)
# melt from wide to long
= unscaled_loading_df.reset_index().melt(
unscaled_loading_plot_df = 'index',
id_vars = 'Principal Component',
var_name = 'Loading'
value_name
).rename(= {'index': 'Variable'}
columns
)
# add a column of zeros to encode for x = 0 line to plot
'zero'] = np.repeat(0, len(unscaled_loading_plot_df))
unscaled_loading_plot_df[
# create base layer
= alt.Chart(unscaled_loading_plot_df)
base
# create lines + points for loadings
= base.mark_line(point = True).encode(
loadings = alt.X('Variable', title = ''),
y = 'Loading',
x = 'Principal Component'
color
)
# create line at zero
= base.mark_rule().encode(x = alt.X('zero', title = 'Loading'), size = alt.value(0.05))
rule
# layer
= (loadings + rule).properties(width = 120, title = 'Loadings from unscaled PCA')
loading_plot
# show
= alt.Column('Principal Component', title = '')) loading_plot.facet(column
```

Notice that the variables with nonzero loadings in unscaled PCA are simply the three variables with the largest variances.

```
# three largest variances
= False).head(3) x_mx.var().sort_values(ascending
```

# Exploratory analysis based on PCA

Now that we have the principal components, we can use them for exploratory data visualizations. To this end, let’s retrieve the scores from the components you selected:

```
# subset scores
= pca.scores.iloc[:, 0:num_pc]
score_df
# rename columns
= score_df.rename(
score_df = dict(zip(score_df.columns, ['PC' + str(i) for i in range(1, num_pc + 1)]))
columns
)
# add state and county
'State', 'County']] = census[['State', 'County']]
score_df[[
# print
score_df.head()
```

The PC’s can be used to construct scatterplots of the data and search for patterns. We’ll illustrate that by identifying some outliers. The cell below plots PC2 (employment type) against PC4 (carpooling?):

```
# base chart
= alt.Chart(score_df)
base
# data scatter
= base.mark_point(opacity = 0.2).encode(
scatter = alt.X('PC2:Q', title = 'Self-employment PC'),
x = alt.Y('PC4:Q', title = 'Carpooling PC')
y
)
# show
scatter
```

Notice that there are a handful of outlying points in the upper right region away from the dense scatter. What are those?

In order to inspect the outlying counties, we first need to figure out how to identify them. The outlying values have a large *sum* of PC2 and PC4. We can distinguish them by finding a cutoff value for the sum; a simple quantile will do.

```
# find cutoff value
= (score_df.PC2 + score_df.PC4)
pc2_pc4_sum = pc2_pc4_sum.quantile(0.99999)
cutoff
# store outlying rows using cutoff
= score_df[(-score_df.PC2 + score_df.PC4) > cutoff]
outliers
# plot outliers in red
= alt.Chart(outliers).mark_circle(
pts = 'red',
color = 0.3
opacity
).encode(= 'PC2',
x = 'PC4'
y
)
# layer
+ pts scatter
```

Notice that almost all the outlying counties are remote regions of Alaska:

` outliers`

What sets them apart? The cell below retrieves the normalized data and county name for the outlying rows, and then plots the Standardized values of each variable for all 9 counties as vertical ticks, along with a point indicating the mean for the outlying counties. This plot can be used to determine which variables are over- or under-average for the outlying counties relative to the nation by simply locating means that are far from zero in either direction.

```
= (x_mx - x_mx.mean())/x_mx.std()
x_ctr
# retrieve normalized data for outlying rows
= x_ctr.loc[outliers.index.values].join(
outlier_data 'County']]
census.loc[outliers.index.values, [
)
# melt to long format for plotting
= outlier_data.melt(
outlier_plot_df = 'County',
id_vars = 'Variable',
var_name = 'Standardized value'
value_name
)
# plot ticks for values (x) for each variable (y)
= alt.Chart(outlier_plot_df).mark_tick().encode(
ticks = 'Standardized value',
x = 'Variable'
y
)
# shade out region within 3SD of mean
= alt.Chart(
grey
pd.DataFrame('Variable': x_ctr.columns,
{'upr': np.repeat(3, 22),
'lwr': np.repeat(-3, 22)}
)= 0.2, color = 'gray').encode(
).mark_area(opacity = 'Variable',
y = alt.X('upr', title = 'Standardized value'),
x = 'lwr'
x2
)
# compute means of each variable across counties
= alt.Chart(outlier_plot_df).transform_aggregate(
means = 'mean(Standardized value)',
group_mean = ['Variable']
groupby
).transform_calculate(= 'abs(datum.group_mean) > 3'
large = 80).encode(
).mark_circle(size = 'group_mean:Q',
x = 'Variable',
y = alt.Color('large:N', legend = None)
color
)
# layer
+ grey + means ticks
```

### Question 7

The two variables that clearly set the outlying counties apart from the nation are the percentage of the population using alternative transportation (extremely above average) and the percentage that drive to work (extremely below average). What about those counties explains this?

(*Hint:* take a peek at the Wikipedia page on transportation in Alaska.)

*Type your answer here, replacing this text.*

# Submission

- Save the notebook.
- Restart the kernel and run all cells. (
**CAUTION**: if your notebook is not saved, you will lose your work.) - Carefully look through your notebook and verify that all computations execute correctly and all graphics are displayed clearly. You should see
**no errors**; if there are any errors, make sure to correct them before you submit the notebook. - Download the notebook as an
`.ipynb`

file. This is your backup copy. - Export the notebook as PDF and upload to Gradescope.

To double-check your work, the cell below will rerun all of the autograder tests.

` grader.check_all()`