PSTAT100 - Principal components

From last time

Facts:

correlation is a measure of (linear) dependence
correlation matrices – arrays of pairwise correlations – are a common summary for multivariate data
a basis can be thought of as an alternative coordinate system
the eigendecomposition of a correlation matrix yields a basis for on which the standardized data are uncorrelated

Principal components

Let be an data matrix and be the matrix of eigenvectors of .

The principal components (of ) are the coordinates of each observation on the eigenbasis :

The name comes from the observation that gives the ‘main directions’ of variability in the data.

A remark

The eigenbasis from the correlation matrix can also be recovered from the singular value decomposition of .

Most implementations use SVD, so don’t be surprised if you see this more often than eigendecomposition.

PCA in the low-dimensional setting

Let’s consider finding the principal components for variables. Consider:

To get the correlation matrix, first compute .

Code

# scatterplot of unscaled data
raw = alt.Chart(x_mx).mark_point(opacity = 0.1, color = 'black').encode(
    x = alt.X('Social_Domain', scale = alt.Scale(domain = [0.1, 0.8])),
    y = alt.Y('Econ_Domain', scale = alt.Scale(domain = [0.1, 0.8]))
).properties(width = 200, height = 200, title = 'original data (X)')

# mean vector
mean = alt.Chart(
    pd.DataFrame(x_mx.mean()).transpose()
).mark_circle(color = 'red', size = 100).encode(
    x = alt.X('Social_Domain'),
    y = alt.Y('Econ_Domain')
)

# scatterplot of centered and scaled data
centered = alt.Chart(z_mx).mark_point(opacity = 0.1, color = 'black').encode(
    x = alt.X('Social_Domain'),
    y = alt.Y('Econ_Domain')
).properties(width = 200, height = 200, title = 'centered and scaled (Z)')

# mean vector
mean_ctr = alt.Chart(
    pd.DataFrame(z_mx.mean()).transpose()
).mark_circle(color = 'red', size = 100).encode(
    x = alt.X('Social_Domain'),
    y = alt.Y('Econ_Domain')
)

# lines at zero
axbase = alt.Chart(
    pd.DataFrame({'Social_Domain': 0, 'Econ_Domain': 0}, index = [0])
)
ax1 = axbase.mark_rule().encode(x = 'Social_Domain')
ax2 = axbase.mark_rule().encode(y = 'Econ_Domain')

#layer
fig1 = (raw + mean) | (centered + mean_ctr + ax1 + ax2)

fig1.configure_axis(
    domain = False,
    labelFontSize = 14,
    titleFontSize = 16
    ).configure_title(
        fontSize = 16
    )

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

PCA in the low-dimensional setting

Now we’ll compute the eigendecomposition.

Code

x_mx = x_mx.iloc[:, 0:2]
z_mx = z_mx.iloc[:, 0:2]

# compute correlation mx
r_mx = z_mx.transpose().dot(z_mx)/(len(z_mx) - 1)

# eigendecomposition
r_eig = linalg.eig(r_mx.values)

# show PC directions
directions = pd.DataFrame(r_eig[1], columns = ['PC1_Direction', 'PC2_Direction'], index = r_mx.columns)
directions

	PC1_Direction	PC2_Direction
Econ_Domain	0.707107	-0.707107
Social_Domain	0.707107	0.707107

The ‘principal component directions’ are simply the eigenvectors.

PCA in the low-dimensional setting

Let’s plot the principal component directions on the centered and scaled data.

Code

# store directions as dataframe for plotting
direction_df = pd.DataFrame(
    np.vstack([np.zeros(2), r_eig[1][:, 0], np.zeros(2), r_eig[1][:, 1]]),
    columns = ['Econ_Domain', 'Social_Domain']
).join(
    pd.Series(np.repeat(['PC1', 'PC2'], 2), name = 'PC direction')
)

# plot directions as vectors
eigenbasis = alt.Chart(direction_df).mark_line(order = False).encode(
    x = 'Social_Domain', 
    y = 'Econ_Domain', 
    color = alt.Color('PC direction', scale = alt.Scale(scheme = 'dark2'))
)

# combine with scatter
centered_plot = (centered.properties(width = 300, height = 300) + mean_ctr + ax1 + ax2)

(centered_plot + eigenbasis).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
).configure_title(
    fontSize = 16
).configure_legend(
    labelFontSize = 14,
    titleFontSize = 16
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Where do these directions point relative to the scatter?

Geometry of PCA

Now scale the directions by the corresponding eigenvalues (plotting and instead of and ).

Code

# scale directions by eigenvalues
direction_df = pd.DataFrame(
    np.vstack([np.zeros(2), r_eig[1][:, 0]*r_eig[0][0].real, np.zeros(2), r_eig[1][:, 1]*r_eig[0][1].real]),
    columns = ['Econ_Domain', 'Social_Domain']
).join(
    pd.Series(np.repeat(['PC1', 'PC2'], 2), name = 'PC direction')
)

# repeat plot
eigenbasis = alt.Chart(direction_df).mark_line(order = False).encode(
    x = 'Social_Domain', 
    y = 'Econ_Domain', 
    color = alt.Color('PC direction', scale = alt.Scale(scheme = 'dark2'))
)

(centered_plot + eigenbasis).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
).configure_title(
    fontSize = 16
).configure_legend(
    labelFontSize = 14,
    titleFontSize = 16
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

The principal component directions are the axes along which the data vary most, and the eigenvalues give the magnitude of that variation.

Projection

So if we wanted to look at just one quantity that captures variability in both dimensions we could:

project the data onto the first principal component direction
treat the projected values as a new, derived variable

Code

# project data onto pc directions
pcdata = z_mx.dot(r_eig[1]).rename(columns = {0: 'PC1', 1: 'PC2'})

# scatterplot
projection = alt.Chart(pcdata).mark_point(opacity = 0.1, color = 'black').encode(
    x = alt.X('PC1', title = '', axis = None),
    y = 'PC2'
).properties(
    width = 300, 
    height = 200,
    title = 'Projected data'
)

# layer with univariate tick plot of pc1 values
pc1 = alt.Chart(pcdata).mark_tick(color = '#1B9E77').encode(x = 'PC1').properties(width = 300)

(projection & pc1).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
).configure_title(fontSize = 16)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Is it a problem to just drop PC2?

Quantifying variation capture/loss

To figure out how much variation/covariation is captured and lost, we need to know how much is available in the first place.

One measure of the total variation in is given by the determinant of the correlation matrix :

Now let be the th principal component. Its variance is:

Quantifying variation capture/loss

So the total variation is the sum of the eigenvalues, and the variance of each PC is the corresponding eigenvalue.

We can therefore define the proportion of total variation explained by the th principal component as:

This is sometimes also called the variance ratio for the th principal component.

Quantifying variation capture/loss

So in our example, the variance ratios are:

# compute correlation mx
r_mx = z_mx.transpose().dot(z_mx)/(len(z_mx) - 1)

# eigendecomposition
r_eig = linalg.eig(r_mx.values)

# store eigenvalues as real array
eigenvalues = r_eig[0].real 

# variance ratios
eigenvalues/eigenvalues.sum()

array([0.76574532, 0.23425468])

the first principal component captures 77% of total variation
the second captures 23% of total variation

Interpreting principal components

So we have obtained a single derived variable that captures 3/4 of total variation.

But what is the meaning of the derived variable? What does it represent in the context of the data?

The values of the first principal component (green ticks) are given by:

So the principal component is a linear combination of the underlying variables.

the coefficients are known as loadings
the values are known as principal component scores

In this case, the PC1 loadings are equal; so this principal component is simply the average of the social and economic indices.

Interpreting principal components

The loadings for the second component are:

r_eig[1][:, 0]

array([0.70710678, 0.70710678])

How would you interpret this quantity?

Packages for PCA

In scikit-learn, one must preprocess the data by centering and scaling:

from sklearn.decomposition import PCA

# center and scale data
z_mx = (x_mx - x_mx.mean())/x_mx.std()

# compute principal components
pca = PCA(n_components = 2)
pca.fit(z_mx)

Check that the results are the same:

# loadings
pca.components_

array([[-0.70710678, -0.70710678],
       [ 0.70710678, -0.70710678]])

The variance ratios are stored as an attribute:

# variance ratios
pca.explained_variance_ratio_

array([0.76574532, 0.23425468])

Packages for PCA

In scikit-learn, to retrieve the scores – the coordinates on the principal axes – one must ‘transform’ input data:

# get scores
pca.transform(z_mx)

array([[-1.29591927,  0.32681298],
       [ 0.98357823,  0.34852919],
       [ 0.93478265,  1.17632266],
       ...,
       [ 0.61009199, -0.73354446],
       [-0.13641772, -0.34803313],
       [ 3.44303576, -0.86984466]])

useful if you want to compute PC’s from one dataset and then apply the transformation to a new collection of observations
inconvenient if all you want are the scores of the input data

Packages for PCA

In statsmodels, the implementation is a bit more streamlined:

from statsmodels.multivariate.pca import PCA
pca = PCA(x_mx, normalize = False, standardize = True)
pca.loadings

	comp_0	comp_1
Econ_Domain	-0.707107	-0.707107
Social_Domain	-0.707107	0.707107

input here is x_mx – no need to preprocess
standardize = True will center and scale the input data x_mx
normalize = True rescales the scores (not the input data)
see documentation for additional control arguments

Packages for PCA

For this implementation the scores of the input data are retained as an attribute .scores:

pca.scores.head(3)

	comp_0	comp_1
0	-1.296614	-0.326988
1	0.984106	-0.348716
2	0.935284	-1.176954

The variance ratios, however, must be calculated manually:

pca.eigenvals/pca.eigenvals.sum()

0    0.765745
1    0.234255
Name: eigenvals, dtype: float64

Reconstruction and approximation

Interestingly, statsmodels has a .project() method, which will compute a reconstruction of the original data from a specified number of PC’s.

If we use both PC’s…

pca.project(ncomp = 2).head(3)

	Econ_Domain	Social_Domain
0	0.565264	0.591259
1	0.427671	0.520744
2	0.481092	0.496874

We get the input data back again:

x_mx.head(3)

	Econ_Domain	Social_Domain
0	0.565264	0.591259
1	0.427671	0.520744
2	0.481092	0.496874

Reconstruction and approximation

If we use only one PC, we get a low-dimensional approximation:

pca.project(ncomp = 1).head(3)

	Econ_Domain	Social_Domain
0	0.545347	0.601274
1	0.406431	0.531424
2	0.409404	0.532919

Compare with input data:

x_mx.head(3)

	Econ_Domain	Social_Domain
0	0.565264	0.591259
1	0.427671	0.520744
2	0.481092	0.496874

Effect of standardization

If the data are not standardized, then the covariance matrix is decomposed instead of the correlation matrix.

# refit without standardization
pca_unscaled = PCA(x_mx, standardize = False, normalize = False)

# examine
print('variance ratios: ', pca_unscaled.eigenvals.values/pca_unscaled.eigenvals.sum())
pca_unscaled.loadings

variance ratios:  [0.86663761 0.13336239]

	comp_0	comp_1
Econ_Domain	-0.952189	-0.305509
Social_Domain	-0.305509	0.952189

The economic index gets upweighted significantly
The first component captures more variance

Effect of standardization

When the data are not standardized, the method is susceptible to problems of scale.

In our example, the economic index became more ‘important’, but this is only because it has a larger variance:

x_mx.cov()

	Econ_Domain	Social_Domain
Econ_Domain	0.007428	0.001985
Social_Domain	0.001985	0.001878

So if the covariance (not correlation) matrix is decomposed, economic sustainability accounts for more of the total variation measure, because it dominates the variance-covariance matrix, but only because of scale.

Effect of standardization

Visually, here is the difference.

Code

direction_df = pd.DataFrame(
    np.vstack([np.zeros(2), r_eig[1][:, 0]*r_eig[0][0].real, np.zeros(2), r_eig[1][:, 1]*r_eig[0][1].real]),
    columns = ['Econ_Domain', 'Social_Domain']
).join(
    pd.Series(np.repeat(['PC1', 'PC2'], 2), name = 'PC direction')
)

# repeat plot
eigenbasis = alt.Chart(direction_df).mark_line(order = False).encode(
    x = 'Social_Domain', 
    y = 'Econ_Domain', 
    color = alt.Color('PC direction', scale = alt.Scale(scheme = 'dark2'))
)

c_eig = linalg.eig(x_mx.cov())
c_eigenval_std = c_eig[0].real/c_eig[0].real.sum()

direction_df_cov = pd.DataFrame(
    30*np.vstack([np.zeros(2), c_eig[1][:, 0]*c_eig[0][0].real, np.zeros(2), c_eig[1][:, 1]*c_eig[0][1].real]) + x_mx.mean().values,
    columns = ['Econ_Domain', 'Social_Domain']
).join(
    pd.Series(np.repeat(['PC1', 'PC2'], 2), name = 'PC direction')
)

eigenbasis_cov = alt.Chart(direction_df_cov).mark_line(order = False).encode(
    x = 'Social_Domain', 
    y = 'Econ_Domain', 
    color = alt.Color('PC direction', scale = alt.Scale(scheme = 'dark2'))
)

(raw + eigenbasis_cov | centered + eigenbasis).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
).configure_title(
    fontSize = 16
).configure_legend(
    labelFontSize = 14,
    titleFontSize = 16
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Principal axes are sensitive to scale
If not standardized, the principal components will upweight variables with larger variances

Terminology

A quick review of PCA terminology:

the principal components are the eigenbasis vectors; the principal axes or directions
the loadings are the eigenvectors; these correspond to the principal axes or principal directions
the scores are the data values projected onto the principal axes
- some refer to scores as principal components instead
the variance ratios are the proportions of total variance explained by each component/direction

PCA in higher dimensions

PCA is rarely used for bivariate data; it is a multivariate technique.

The technique is this:

compute all principal components
examine the variance ratios
select a subset of components that collectively explain ‘enough’ variance

PCA in higher dimensions

Now that we’ve reviewed the basic technique, we can consider cases with a larger number of variables.

This is really where PCA is most useful.

It can help answer the question: which variables are driving variation in the data?
It can help reduce data dimensions by finding combinations of variables that preserve variation.
It can provide a means of visualizing high-dimensional data.

Example application

So, let’s look at a different example: world development indicators.

Code

wdi = pd.read_csv('data/wdi-data.csv').iloc[:, 2:].set_index('country')
wdi.head(3)

	inequality_adjusted_life	inequality_adjusted_education	maternal_mortality	parliament_pct_women	labor_participation_women	labor_participation_men	total_pop	urban_pct_pop	pop_under5	pop_15to64	...	total_unemployment	youth_unemployment	prison_5yr	trade	foreign_investment	net_migration	immigrant_pop	tourists_2018	internet_users_2018	mobile_users_2018
country
Norway	0.931	0.908	2	40.8	60.4	67.2	5.4	82.6	0.055556	0.648148	...	3.3	9.3	80	72.1	0.4	5.3	16.1	5688	96.5	107.2
Ireland	0.926	0.892	5	24.3	56.0	68.4	4.9	63.4	0.061224	0.653061	...	4.9	13.1	82	239.2	-20.4	4.9	17.1	10926	84.5	103.2
Switzerland	0.947	0.883	5	38.6	62.9	73.8	8.6	73.8	0.046512	0.662791	...	4.6	7.4	77	119.4	-2.6	6.1	29.9	10362	89.7	129.6

3 rows × 31 columns

Computation via sklearn

Recall that this implementation requires centering and scaling the data first ‘by hand’.

from sklearn.decomposition import PCA

# center and scale
wdi_ctr = (wdi - wdi.mean())/wdi.std()

# compute principal components
pca = PCA(31)
pca.fit(wdi_ctr)

Variance ratios

Selecting the number of principal components to use is somewhat subjective, but always based on the variance ratios and their cumulative sum:

Code

# store proportion of variance explained as a dataframe
pcvars = pd.DataFrame({'Proportion of variance explained': pca.explained_variance_ratio_})

# add component number as a new column
pcvars['Component'] = np.arange(1, 32)

# add cumulative variance explained as a new column
pcvars['Cumulative variance explained'] = pcvars.iloc[:, 0].cumsum(axis = 0)

# encode component axis only as base layer
base = alt.Chart(pcvars).encode(
    x = 'Component')

# make a base layer for the proportion of variance explained
prop_var_base = base.encode(
    y = alt.Y('Proportion of variance explained',
              axis = alt.Axis(titleColor = '#57A44C'))
)

# make a base layer for the cumulative variance explained
cum_var_base = base.encode(
    y = alt.Y('Cumulative variance explained', axis = alt.Axis(titleColor = '#5276A7'))
)

# add points and lines to each base layer
line = alt.Chart(pd.DataFrame({'Component': [2.5]})).mark_rule(opacity = 0.3, color = 'red').encode(x = 'Component')
prop_var = prop_var_base.mark_line(stroke = '#57A44C') + prop_var_base.mark_point(color = '#57A44C') + line
cum_var = cum_var_base.mark_line() + cum_var_base.mark_point() + line

# layer the layers
variance_plot = alt.layer(prop_var, cum_var).resolve_scale(y = 'independent') 

variance_plot.properties(height = 200, width = 400).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

In this case, the variance ratios drop off after 2 components. These first two capture about half of total variation in the data.

Loading plots

Examining the loadings graphically can help to interpret the components.

Code

# store the loadings as a data frame with appropriate names
loading_df = pd.DataFrame(pca.components_).transpose().rename(
    columns = {0: 'PC1', 1: 'PC2'}
).loc[:, ['PC1', 'PC2']]

# add a column with the taxon names
loading_df['indicator'] = wdi_ctr.columns.values

# melt from wide to long
loading_plot_df = loading_df.melt(
    id_vars = 'indicator',
    var_name = 'PC',
    value_name = 'Loading'
)

# create base layer with encoding
base = alt.Chart(loading_plot_df).encode(
    y = alt.X('indicator', title = ''),
    x = 'Loading',
    color = 'PC'
)

# store horizontal line at zero
rule = alt.Chart(pd.DataFrame({'Loading': 0}, index = [0])).mark_rule().encode(x = 'Loading', size = alt.value(2))

# layer points + lines + rule to construct loading plot
loading_plot = base.mark_point() + base.mark_line() + rule

loading_plot.properties(width = 300, height = 400).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Visualization

Finally, we might want to use the first two principal components to visualize the data.

Code

# project pcdata onto first two components; store as data frame
projected_data = pd.DataFrame(pca.fit_transform(wdi_ctr)).iloc[:, 0:2].rename(columns = {0: 'PC1', 1: 'PC2'})

# add index and reset
projected_data.index = wdi_ctr.index
projected_data = projected_data.reset_index()

# append one of the original variables
projected_data['gdppc'] = wdi.gdp_percapita.values
projected_data['pop'] = wdi.total_pop.values

# base chart
base = alt.Chart(projected_data)

# data scatter
scatter = base.mark_point().encode(
    x = alt.X('PC1:Q', title = 'Mortality'),
    y = alt.Y('PC2:Q', title = 'Labor'),
    color = alt.Color('gdppc', 
                      bin = alt.Bin(maxbins = 6), 
                      scale = alt.Scale(scheme = 'blues'), 
                      title = 'GDP per capita'),
    size = alt.Size('pop',
                   scale = alt.Scale(type = 'sqrt'),
                   title = 'Population')
).properties(width = 400, height = 400)

# text labels
label = projected_data.sort_values(by = 'gdppc', ascending = False).head(4)

text = alt.Chart(label).mark_text(align = 'left', dx = 3).encode(
     x = alt.X('PC1:Q', title = 'Mortality'),
    y = alt.Y('PC2:Q', title = 'Labor'),
    text = 'country'
)

(scatter + text).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
)

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Often it’s helpful to merge the principal components with the original data and apply visualization techniques you already know to search for interesting patterns.

Perspectives on PCA

Now that you know that a subcollection of principal components is usually selected, you have a sense that PCA involves projecting multivariate data onto a subspace.

There are two main perspectives on the meaning of this:

Low-rank approximation of the correlation structure
- This is the perspective we’ve taken in this class
Latent regression or optimization problem
- Find the axes that maximize variance of projected data

Other applications of PCA

PCA can also be used as a filtering technique for image data.

Reconstruction from 12 PC’s computed from the pixel values.

Other applications of PCA

In a similar vein, it can be used as a compression technique.

Full-size images are roughly 3K pixels
Projecting pixel values onto 150 PC’s and then reconstructing the data from this subset yields a heavily compressed image