GEOID_MSA | Name | Econ_Domain | Social_Domain | Env_Domain | Sustain_Index | |
---|---|---|---|---|---|---|

0 | 310M300US10100 | Aberdeen, SD Micro Area | 0.565264 | 0.591259 | 0.444472 | 1.600995 |

1 | 310M300US10140 | Aberdeen, WA Micro Area | 0.427671 | 0.520744 | 0.429274 | 1.377689 |

PSTAT100 Spring 2023

Invalid Date

**Covariance and correlation**- Definitions
- Covariance and correlation matrices

**Eigendecomposition**- The eigenvalue problem
- Geometric interpretation
- Computations

**Principal components analysis**- PCA in the low-dimensional setting
- Variation capture and loss
- Interpreting principal components
- Dimension reduction

We’ll use the dataset on the sustainability of U.S. cities introduced last time:

GEOID_MSA | Name | Econ_Domain | Social_Domain | Env_Domain | Sustain_Index | |
---|---|---|---|---|---|---|

0 | 310M300US10100 | Aberdeen, SD Micro Area | 0.565264 | 0.591259 | 0.444472 | 1.600995 |

1 | 310M300US10140 | Aberdeen, WA Micro Area | 0.427671 | 0.520744 | 0.429274 | 1.377689 |

For each **M**etropolitan **S**tatistical **A**rea (MSA), a sustainability index is calculated based on economic, social, and environmental indicators (also indices):

\[\text{sustainability index} = \text{economic} + \text{social} + \text{environmental}\]

The domain indices are computed from a large number of development indicator variables.

* If you’re interested,* you can dig deeper on the Sustainable Development Report website, which provides detailed data reports related to the U.N.’s 2030 sustainable development goals.

**Covariation** refers to * the tendency of two variables to change together across observations*. Covariation is about relationships.

```
# scatterplot of social vs economic indices
econ_social = alt.Chart(city_sust).mark_point().encode(
x = alt.X('Econ_Domain', scale = alt.Scale(zero = False)),
y = alt.Y('Social_Domain', scale = alt.Scale(zero = False))
).properties(
width = 350,
height = 200
)
econ_social.configure_axis(
labelFontSize = 14,
titleFontSize = 16
)
```

The social and economic indices do seem to *vary together*: higher values of the economic index coincide with higher values of the social index. That’s all there is to it.

Let \((x_1, y_1) \dots, (x_n, y_n)\) denote \(n\) values of two variables, \(X\) and \(Y\).

If \(X\) and \(Y\) tend to vary together, then whenever \(X\) is far from its mean, so is \(Y\): in other words, *their deviations coincide*.

This coincidence (or lack thereof) is measured quantiatively by the (sample) **covariance**:

\[ \text{cov}(\mathbf{x}, \mathbf{y}) = \frac{1}{n - 1}\sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y}) \]

Note \(\text{cov}(\mathbf{x}, \mathbf{x}) = \text{var}(\mathbf{x})\).

The sum can be written as an inner product. First, ‘center’ \(\mathbf{x}\) and \(\mathbf{y}\):

\[ \tilde{\mathbf{x}} = \left[\begin{array}{c} x_1 - \bar{x} \\ \vdots \\ x_n - \bar{x} \end{array}\right] \quad \tilde{\mathbf{y}} = \left[\begin{array}{c} y_1 - \bar{y} \\ \vdots \\ y_n - \bar{y} \end{array}\right] \]

Then, the sample covariance is:

\[ \text{cov}(\mathbf{x}, \mathbf{y}) = \frac{\tilde{\mathbf{x}}^T \tilde{\mathbf{y}}}{n - 1} \]

Covariance is a little tricky to interpret. Is 0.00199 large or small?

It is more useful to compute the (sample) **correlation**:

\[ \text{corr}(\mathbf{x}, \mathbf{y}) = \frac{\text{cov}(\mathbf{x}, \mathbf{y})}{S_x S_y} \]

This is simply a standardized covariance measure.

\(\text{corr}(\mathbf{x}, \mathbf{y}) = 1, -1\) are the strongest possible correlations

\(\text{corr}(\mathbf{x}, \mathbf{y}) = 0\) is the weakest possible correlation

the sign indicates whether \(X\) and \(Y\) vary together or in opposition

\(\text{corr}(\mathbf{x}, \mathbf{x}) = 1\), since any variable’s deviations coincide perfectly with themselves

Standardizing the covariance makes it more interpretable:

The correlation indicates that the social and economic indices vary together (positive) moderately (halfway from zero to one).

* This is just a number that quantifies what you already knew from the graphic*: there is a positive relationship.

What we will call ‘correlation’ in this class is known as the *Pearson correlation coefficient*.

There are other correlation measures:

- Spearman correlation: Pearson correlation between
*ranks*of observations - Kendall rank correlation
- Distribution-specific dependence measures (
*e.g.*, circular data)

No correlation does **not** imply no relationship – symmetry can produce strongly related but uncorrelated data.

```
np.random.seed(50323)
# simulate observations of x
n = 100
x = np.random.uniform(low = 0, high = 1, size = n)
sim_df = pd.DataFrame({'x': x})
# center x, center y, scale
a, b, c = 0.5, 0.5, 3
# noise
noise_sd = 0.1
noise = np.random.normal(loc = 0, scale = noise_sd, size = n)
# simulate observations of y
sim_df['y'] = c*(x - a)*(x - b) + noise
# plot
scatter = alt.Chart(
sim_df
).mark_point().encode(
x = 'x',
y = 'y'
)
# compute correlation
print('correlation: ', sim_df.corr().loc['x', 'y'])
scatter.configure_axis(
labelFontSize = 14,
titleFontSize = 16
)
```

`correlation: 0.07227477481863404`

```
np.random.seed(50323)
# simulate observations of x
n = 100
x = np.random.uniform(low = 0, high = 1, size = n)
sim_df = pd.DataFrame({'x': x})
# center x, center y, scale
a, b, c = 0.5, 0.5, 3
# noise
noise_sd = 0.2
noise = np.random.normal(loc = 0, scale = noise_sd, size = n)
# simulate observations of y
sim_df['y'] = np.cos(4*np.pi*x) + noise
# plot
scatter = alt.Chart(
sim_df
).mark_point().encode(
x = 'x',
y = 'y'
)
# compute correlation
print('correlation: ', sim_df.corr().loc['x', 'y'])
scatter.configure_axis(
labelFontSize = 14,
titleFontSize = 16
)
```

`correlation: 0.04482962033696271`

Correlation measures the strength of *linear* relationships. But what does “strength” mean exactly?

- \(\text{cor}(\mathbf{x}, \mathbf{y}) = 1\) implies the data lie exactly on a line with positive slope
- \(\text{cor}(\mathbf{x}, \mathbf{y}) = -1\) implies the data lie exactly on a line with negative slope
- \(\text{cor}(\mathbf{x}, \mathbf{y}) = 0\) implies that the best linear fit to the data is a horizontal line
- values near \(1\) or \(-1\) imply the data lie
*near*a line with nonzero slope

Correlations are affected by outliers – low correlation does **not** imply no relationship.

```
np.random.seed(50423)
# intercept, slope
a, b = 1, -1
# noise
noise_sd = 0.1
noise = np.random.normal(loc = 0, scale = noise_sd, size = n)
# simulate y
sim_df['y'] = a + b*x + noise
sim_df.loc[100] = [3, 3]
# plot
scatter = alt.Chart(
sim_df
).mark_point().encode(
x = 'x',
y = 'y'
)
# compute correlation
print('correlation: ', sim_df.corr().loc['x', 'y'])
sim_df = sim_df.loc[0:99].copy()
scatter.configure_axis(
labelFontSize = 14,
titleFontSize = 16
)
```

`correlation: -0.057700019900550986`

A strong correlation does **not** imply a *meaningful* relationship – it could be practically insignificant.

```
np.random.seed(50423)
# intercept, slope
a, b = -0.002, 0.005
# noise
noise_sd = 0.001
noise = np.random.normal(loc = 0, scale = noise_sd, size = 100)
# simulate y
sim_df['y'] = a + b*x + noise
# plot
scatter = alt.Chart(
sim_df
).mark_point().encode(
x = alt.X('x', title = 'rate'),
y = alt.Y('y', title = 'earnings (USD)')
)
# compute correlation
print('correlation: ', sim_df.corr().loc['x', 'y'])
scatter.configure_axis(
labelFontSize = 14,
titleFontSize = 16
)
```

`correlation: 0.794145106831753`

A weak correlation does **not** imply no linear relationship – it could just be really noisy.

```
np.random.seed(50423)
# intercept, slope
a, b = 1, -3
# noise
noise_sd = 4
noise = np.random.normal(loc = 0, scale = noise_sd, size = 100)
# simulate y
sim_df['y'] = a + b*x + noise
# plot
scatter = alt.Chart(
sim_df
).mark_point().encode(
x = 'x',
y = 'y'
)
trend = scatter.transform_regression('x', 'y').mark_line()
# compute correlation
print('correlation: ', sim_df.corr().loc['x', 'y'])
(scatter + trend).configure_axis(
labelFontSize = 14,
titleFontSize = 16
)
```

`correlation: -0.2552753897590402`

It helps to have a number to quantify the strength of a relationship.

For instance, which pair is most related? Are some pairs more related than others?

```
# extract social and economic indices
x_mx = city_sust.iloc[:, 2:5]
# long form dataframe for plotting panel
scatter_df = x_mx.melt(
var_name = 'row',
value_name = 'row_index'
).join(
pd.concat([x_mx, x_mx, x_mx], axis = 0).reset_index(),
).drop(
columns = 'index'
).melt(
id_vars = ['row', 'row_index'],
var_name = 'col',
value_name = 'col_index'
)
# panel
alt.Chart(scatter_df).mark_point(opacity = 0.4).encode(
x = alt.X('row_index', scale = alt.Scale(zero = False), title = ''),
y = alt.Y('col_index', scale = alt.Scale(zero = False), title = '')
).properties(
width = 150,
height = 75
).facet(
column = alt.Column('col', title = ''),
row = alt.Row('row', title = '')
).resolve_scale(x = 'independent', y = 'independent')
```