Smoothing activity

import os
os.environ["OMP_NUM_THREADS"] = "1"
import numpy as np
import pandas as pd
import altair as alt
import statsmodels.api as sm
from statsmodels.nonparametric.kde import kernel_switch
from sklearn import mixture
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

We’ll work with sustainability index data for US cities to explore density estimation further. These data are imported below

sust = pd.read_csv('sustainability.csv')
sust.head()

	GEOID_MSA	Name	Econ_Domain	Social_Domain	Env_Domain	Sustain_Index
0	310M300US10100	Aberdeen, SD Micro Area	0.565264	0.591259	0.444472	1.600995
1	310M300US10140	Aberdeen, WA Micro Area	0.427671	0.520744	0.429274	1.377689
2	310M300US10180	Abilene, TX Metro Area	0.481092	0.496874	0.454192	1.432157
3	310M300US10220	Ada, OK Micro Area	0.466566	0.526206	0.425964	1.418737
4	310M300US10300	Adrian, MI Micro Area	0.497908	0.602440	0.282499	1.382847

There are 933 distinct cities; each row in the dataset is an observation for one city.

n, p = sust.shape
n

sust.Name.unique().shape

(933,)

Univariate density estimation

Let’s use the environmental sustainability index (Env_Domain) initially. The distribution of environmental sustainability across cities is shown below.

hist = alt.Chart(
    sust
).transform_bin(
    'esi',
    field = 'Env_Domain',
    bin = alt.Bin(step = 0.02)
).transform_aggregate(
    count = 'count()',
    groupby = ['esi']
).transform_calculate(
    Density = 'datum.count/(0.02*933)',
    ESI = 'datum.esi + 0.01'
).mark_bar(size = 8).encode(
    x = 'ESI:Q',
    y = 'Density:Q'
)

hist

In lab, you used Altair’s built-in density estimation method. This computes and plots a Gaussian KDE. A common choice for bandwidth is Scott’s rule: \(b = 1.06 \times s \times n^{-1/5}\), where \(s\) is the sample standard deviation. (Altair’s default behavior if no bandwidth is specified is close, but not exactly the same.)

# default bandwitdth parameter
sigma_hat = sust.Env_Domain.std()
bw_scott = 1.06*sigma_hat*n**(-1/5)
bw_scott

0.0146178133632948

smooth = alt.Chart(
    sust
).transform_density(
  'Env_Domain',
  as_ = ['Environmental sustainability index', 'Density'],
  extent = [0.1, 0.8],
  bandwidth = bw_scott
).mark_line(color = 'black').encode(
    x = 'Environmental sustainability index:Q',
    y = 'Density:Q'
)

hist + smooth

Let’s explore alternative kernels. To do this, we’ll need to compute the density estimate separately before plotting. The statsmodels package has KDE implementations with multiple kernels.

To start, let’s recompute the Gaussian KDE with the same bandwidth just to confirm we obtain the same result.

# compute the gaussian KDE using statsmodels
kde = sm.nonparametric.KDEUnivariate(sust.Env_Domain)
kde.fit(bw = bw_scott)

<statsmodels.nonparametric.kde.KDEUnivariate at 0x23251a9eb50>

Executing this cell produces an object kde with the density estimate and other quantities of interest stored as attributes. We’ll arrange these into a dataframe:

kde_df = pd.DataFrame({'Environmental sustainability index': kde.support, 
                       'Density': kde.density})
kde_df.head()

	Environmental sustainability index	Density
0	0.088614	0.000609
1	0.089334	0.000621
2	0.090054	0.000645
3	0.090774	0.000682
4	0.091494	0.000731

Plotting the KDE we just computed on top of the Altair version reveals they are identical.

smooth2 = alt.Chart(
    kde_df
).mark_line(color = 'black').encode(
    x = 'Environmental sustainability index:Q',
    y = alt.Y('Density:Q', scale = alt.Scale(domain = [0, 12]))
)

smooth + smooth2

Now let’s try varying the kernel. Here are our options:

kernel_switch.keys()

dict_keys(['gau', 'epa', 'uni', 'tri', 'biw', 'triw', 'cos', 'cos2', 'tric'])

The ‘local histogram’ introduced last lecture is a KDE with a uniform kernel. We can compute and plot it as follows:

# compute density estimate
kde = sm.nonparametric.KDEUnivariate(sust.Env_Domain)
kde.fit(kernel = 'uni', fft = False, bw = 0.02)

# arrange as dataframe
kde_df = pd.DataFrame({'Environmental sustainability index': kde.support, 'Density': kde.density})

# plot
smooth2 = alt.Chart(
    kde_df
).mark_line(color = 'black').encode(
    x = 'Environmental sustainability index:Q',
    y = alt.Y('Density:Q', scale = alt.Scale(domain = [0, 12]))
)

hist + smooth2 + smooth2.mark_area(opacity = 0.3)

Exploration

Use the cell below to experiment and answer the following.

How does the KDE differ if a parabolic (epa) kernel is used in place of a Gaussian (gau) kernel while the bandwidth is held constant?
What effect does a triangular kernel (tri) have on how local peaks appear?
Pick two kernels. What will happen to the KDE for large bandwidths?
Which kernel seems to do the best job at capturing the shape closely without under-smoothing?

# compute density estimate
kde = sm.nonparametric.KDEUnivariate(sust.Env_Domain)
kde.fit(kernel = 'uni', fft = False, bw = 0.02)

# arrange as dataframe
kde_df = pd.DataFrame({'Environmental sustainability index': kde.support, 'Density': kde.density})

# plot
smooth2 = alt.Chart(
    kde_df
).mark_line(color = 'black').encode(
    x = 'Environmental sustainability index:Q',
    y = alt.Y('Density:Q', scale = alt.Scale(domain = [0, 12]))
)

hist + smooth2 + smooth2.mark_area(opacity = 0.3)

Record your observations here.

Multivariate KDE

We’ll use the same data to illustrate multivariate density estimation. For this we’ll consider the joint distribution of environmental and economic sustainability indices. The next cell generates a plot of the values.

alt.Chart(
    sust
).mark_point().encode(
    x = alt.X('Env_Domain', title = 'Environmental sustainability'),
    y = alt.Y('Econ_Domain', title = 'Economic sustainability')
)

There are a few options for displaying a 2-D histogram. One is to bin and plot a heatmap:

alt.Chart(
    sust
).mark_rect().encode(
    x = alt.X('Env_Domain', bin = alt.Bin(maxbins = 40), title = 'Environmental sustainability'),
    y = alt.Y('Econ_Domain', bin = alt.Bin(maxbins = 40), title = 'Economic sustainability'),
    color = alt.Color('count()', scale = alt.Scale(scheme = 'bluepurple'), title = 'Number of U.S. cities')
)

Another option is to make a bubble chart with the size of the bubble proportional to the count of observations in the corresponding bin.

alt.Chart(
    sust
).mark_circle().encode(
    x = alt.X('Env_Domain', bin = alt.Bin(maxbins = 40), title = 'Environmental sustainability'),
    y = alt.Y('Econ_Domain', bin = alt.Bin(maxbins = 40), title = 'Economic sustainability'),
    size = alt.Size('count()', scale = alt.Scale(scheme = 'bluepurple'))
)

The cell below computes a multivariate Gaussian KDE. There is minimal control over bandwidth specification – just a few estimation methods rather than a numerical input. We’ll use the defaults for the purposes of illustration.

# retrieve data as 2-d array (num observations, num variables)
fit_ary = sust.loc[:, ['Env_Domain', 'Econ_Domain']].values

# compute KDE
kde = sm.nonparametric.KDEMultivariate(fit_ary, 'cc')
kde

KDE instance
Number of variables: k_vars = 2
Number of samples:   nobs = 933
Variable types:      cc
BW selection method: normal_reference

To visualize this estimate, a bit of work is required. kde.pdf([x1, x2, ...]) will compute the estimated density at the point x1, x2, ...; so we will make a fine grid spanning the range of the data, compute the estimated density for each grid point, and then construct a graphic.

# resolution of grid (number of points to use along each axis)
grid_res = 100

# find grid point coordinates along each axis
x1 = np.linspace(start = sust.Env_Domain.min(), stop = sust.Env_Domain.max(), num = grid_res)
x2 = np.linspace(start = sust.Econ_Domain.min(), stop = sust.Econ_Domain.max(), num = grid_res)

# generate a mesh from the coordinates
grid1, grid2 = np.meshgrid(x1, x2, indexing = 'ij')
grid_ary = np.array([grid1.ravel(), grid2.ravel()]).T

# compute the density at each grid point
f_hat = kde.pdf(grid_ary)

# rearrange as a dataframe
grid_df = pd.DataFrame({'env': grid1.reshape(grid_res**2), 
                        'econ': grid2.reshape(grid_res**2),
                        'density': f_hat})

# preview, for understanding
grid_df.head()

	env	econ	density
0	0.132467	0.143811	1.325548e-17
1	0.132467	0.150278	6.367762e-17
2	0.132467	0.156745	2.929104e-16
3	0.132467	0.163212	1.290785e-15
4	0.132467	0.169679	5.451826e-15

Since Altair doesn’t have the option to make contour plots, we’ll make a heatmap. To compare this with the data, let’s overlay the heatmap on the bubble chart.

# kde
kde_smooth = alt.Chart(
    grid_df, title = 'Gaussian KDE'
).mark_rect(opacity = 0.8).encode(
    x = alt.X('env', bin = alt.Bin(maxbins = grid_res), title = 'Environmental sustainability'),
    y = alt.Y('econ', bin = alt.Bin(maxbins = grid_res), title = 'Economic sustainability'),
    color = alt.Color('mean(density)', # a bit hacky, but works
                      scale = alt.Scale(scheme = 'bluepurple', type = 'sqrt'),
                      title = 'Density')
)

# histogram
bubble = alt.Chart(
    sust
).mark_circle().encode(
    x = alt.X('Env_Domain', bin = alt.Bin(maxbins = 40), title = 'Environmental sustainability'),
    y = alt.Y('Econ_Domain', bin = alt.Bin(maxbins = 40), title = 'Economic sustainability'),
    size = alt.Size('count()', scale = alt.Scale(scheme = 'bluepurple'))
)

# layer
bubble + kde_smooth

Mixture models

Mixture models provide one alternative to kernel density estimation. These parametrize a distribution as a mix of (usually) Gaussian densities:

\[ f(x) = a_1 \varphi(x; \mu_1, \sigma_1) + \cdots + a_n \varphi(x; \mu_n, \sigma_n) \]

Above the density \(f\) is a mixture of \(n\) components, where the coefficients \(a_1, \dots, a_n\) are the mixing parameters and each component is a Gaussian density with its own mean and variance. Fitting a mixture model involves estimating the mixing parameters and component means and variances.

The cell below estimates a bivariate mixture model with two components for the joint distribution of environmental sustainability and economic sustainability.

# configure and fit mixture model
gmm = mixture.GaussianMixture(n_components = 2, covariance_type = 'full')
gmm.fit(fit_ary)

GaussianMixture(n_components=2)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can inspect the centers of the component densities:

centers = pd.DataFrame(gmm.means_).rename(columns = {0: 'env', 1: 'econ'})
centers

	env	econ
0	0.398172	0.486619
1	0.440855	0.452837

As well as the variances, which in this case are matrices since we are mixing bivariate densities:

gmm.covariances_

array([[[0.00534737, 0.00067683],
        [0.00067683, 0.00384107]],

       [[0.00058321, 0.00021243],
        [0.00021243, 0.0093566 ]]])

Perhaps more informative are the mixing parameters:

gmm.weights_

array([0.40065273, 0.59934727])

These tell us that the joint distribution is about a 60-40 mixture of the two components.

Now, plotting the result is similar to what we had before:

# evaluate log-likelihood
grid_df['gmm'] = np.exp(gmm.score_samples(grid_ary))

# gmm
gmm_smooth = alt.Chart(
    grid_df, title = 'GMM'
).mark_rect(opacity = 0.8).encode(
    x = alt.X('env', bin = alt.Bin(maxbins = grid_res), title = 'Environmental sustainability'),
    y = alt.Y('econ', bin = alt.Bin(maxbins = grid_res), title = 'Economic sustainability'),
    color = alt.Color('mean(gmm)', # a bit hacky, but works
                      scale = alt.Scale(scheme = 'bluepurple', type = 'sqrt'),
                      title = 'Density')
)

# centers of mixture components
ctr = alt.Chart(centers).mark_point(color = 'black', shape = 'triangle').encode(x = 'env', y = 'econ')

# layer
bubble + gmm_smooth + ctr

A side-by-side comparison shows not much difference between the two, except for a bit more smoothness for the GMM:

(bubble + gmm_smooth + ctr) | (bubble + kde_smooth)

The results look similar, but the estimates are very different. The GMM is a parametric model, whereas the KDE is nonparametric: the KDE uses all of the data (933 values of two variables) to estimate the value of the density comprises at any given location, whereas the GMM uses just 14 parameters.

The GMM is especially useful if there are latent subpopulations in the data. To consider an example, recall the simulated hawks data from lab 2 of body lengths for males and females.

# for reproducibility
np.random.seed(40721)

# simulate hypothetical population
female_hawks = pd.DataFrame(
    data = {'length': np.random.normal(loc = 57.5, scale = 3, size = 3000),
            'sex': np.repeat('female', 3000)}
)

male_hawks = pd.DataFrame(
    data = {'length': np.random.normal(loc = 50.5, scale = 3, size = 2000),
            'sex': np.repeat('male', 2000)}
)

population_hawks = pd.concat([female_hawks, male_hawks], axis = 0)

Let’s imagine we have a sample but without recorded sexes.

np.random.seed(50223)
samp = population_hawks.sample(n = 500).drop(columns = 'sex')
samp.head(3)

	length
840	62.513960
2747	52.490258
2698	56.879861

The histogram is generated in the following cell:

hist_hawks = alt.Chart(samp).transform_bin(
    'length',
    field = 'length',
    bin = alt.Bin(step = 2)
).transform_aggregate(
    count = 'count()',
    groupby = ['length']
).transform_calculate(
    density = 'datum.count/1000',
    length = 'datum.length + 1'
).mark_bar(size = 20).encode(
    x = 'length:Q', 
    y = 'density:Q'
)

hist_hawks

Now let’s fit a mixture model with two components:

# configure and fit mixture model
gmm_hawks = mixture.GaussianMixture(n_components = 2)
gmm_hawks.fit(samp.length.values.reshape(-1, 1))

GaussianMixture(n_components=2)

Now, amazingly, the mixture components accurately recover the distributions for each sex, even though this was a latent (unobserved) variable:

print('gmm component means: ', gmm_hawks.means_.ravel())
print('population means by sex: ', population_hawks.groupby('sex').mean().values.ravel())

gmm component means:  [50.70543698 57.33071086]
population means by sex:  [57.5749014  50.48194081]

Ditto variances:

print('gmm component variances: ', gmm_hawks.covariances_.ravel())
print('population variances by sex: ', population_hawks.groupby('sex').var().values.ravel())

gmm component variances:  [ 8.60692088 10.12549367]
population variances by sex:  [9.39645762 8.522606  ]

And the density estimate can be constructed as before, by computing the value of the GMM along a grid of lengths.

# compute a grid of lengths
grid_hawks = np.linspace(population_hawks.length.min(), population_hawks.length.max(), num = 500)
dens = np.exp(gmm_hawks.score_samples(grid_hawks.reshape(-1, 1)))

Here’s a graphic:

gmm_smooth_hawks = alt.Chart(
    pd.DataFrame({'length': grid_hawks, 'density': dens})
).mark_line(color = 'black').encode(
    x = 'length',
    y = 'density'
)

hist_hawks + gmm_smooth_hawks

Explore

Use the cell below to explore the following questions.

What happens if you fit the GMM with different numbers of components?
Does the solution change if the GMM is re-fitted?

# configure and fit mixture model
gmm_hawks = mixture.GaussianMixture(n_components = 2)
gmm_hawks.fit(samp.length.values.reshape(-1, 1))

# print centers
print('component center(s): ', gmm_hawks.means_)

# compute a grid of lengths
grid_hawks = np.linspace(population_hawks.length.min(), population_hawks.length.max(), num = 500)
dens = np.exp(gmm_hawks.score_samples(grid_hawks.reshape(-1, 1)))

# plot
gmm_smooth_hawks = alt.Chart(
    pd.DataFrame({'length': grid_hawks, 'density': dens})
).mark_line(color = 'black').encode(
    x = 'length',
    y = 'density'
)

hist_hawks + gmm_smooth_hawks

component center(s):  [[50.70543698]
 [57.33071086]]

Kernel smoothing and LOESS/LOWESS

Kernel smoothing refers to local averaging via a kernel function. It is conceptually very similar to kernel density estimation, but used to visualize and estimate relationships rather than distributions. To explore further, we’ll use the GDP and life expectancy data from lab 3.

life = pd.read_csv('life-gdp.csv')
life = life[life.Year == 2000].loc[:, ['Country Name', 'All', 'GDP per capita']]
life['GDP per capita'] = np.log(life['GDP per capita'])
life.rename(columns = {'GDP per capita': 'log_gdp', 'All': 'life_exp', 'Country Name': 'country'}, inplace=True)

Note that the data has been filtered to a single year (2000). A scatter plot of log-GDP-per-capita against life expectancy shows a nonlinear trend.

scatter = alt.Chart(
    life
).mark_point().encode(
    x = alt.X('log_gdp', scale = alt.Scale(zero = False), title = 'log(GDP/capita)'),
    y = alt.Y('life_exp', scale = alt.Scale(zero = False), title = 'Life expectancy')
)

scatter

Kernel smoothing is a technique for estimating the conditional mean of one variable given one or more others – in this case, we’ll estimate the mean life expectancy given GDP per capita. The technique consists in fixing GDP per capita and then averaging life expectancy locally among observations having a similar GDP per capita; a kernel function is used to determine the exact weights.

\[ \hat{y}(x) = \frac{\sum_i K_b (x_i - x) y_i}{\sum_i K_b(x_i - x)} \]

Statsmodels has a basic implementation of Gaussian kernel smoothing with automated bandwidth selection.

# calculate kernel smoother
kernreg = sm.nonparametric.KernelReg(endog = life.life_exp.values,
                                     exog = life.log_gdp.values,
                                     reg_type = 'lc', 
                                     var_type = 'c',
                                     ckertype = 'gaussian')

The resulting object has a .fit() method that can be used to predict new values, given a GDP per capita:

kernreg.fit(np.array([4.7]))[0]

array([52.08267076])

Computing fits across a grid of GDP values provides a means of visualizing the smoothed trend:

# grid of gdp values
gdp_grid = np.linspace(life.log_gdp.min(), life.log_gdp.max(), num = 100)

# calculate kernel smooth at each value
fitted_values = kernreg.fit(gdp_grid)[0]

# arrange as data frame
pred_df = pd.DataFrame({'log_gdp': gdp_grid, 'life_exp': fitted_values})

# plot
kernel_smooth = alt.Chart(
    pred_df
).mark_line(
    color = 'black'
).encode(
    x = 'log_gdp', 
    y = alt.Y('life_exp', scale = alt.Scale(zero = False))
)

scatter + kernel_smooth

Notice that the kernel smooth trails off a bit near the boundary – this is because fewer points are being averaged once the smoothing window begins to extend beyond the range of the data.

Locally weighted scatterplot smoothing (LOESS or LOWESS) largely corrects this issue by stitching together lines (or low-order polynomials) fitted to local subsets of data; this way, near the boundary, the estimate still takes account of any trend present in the data.

LOESS still has a smoothing bandwidth; however, in statsmodels.nonparametric.lowess, this is specified by the fraction of the data – and thus the local neighborhood size – used to estimate the mean at any given point.

# fit loess smooth
loess = sm.nonparametric.lowess(endog = life.life_exp.values,
                                exog = life.log_gdp.values, 
                                frac = 0.3,
                                xvals = gdp_grid)

This is implemented a bit differently than the kernel smoothing, in that the values at which to estimate the mean are supplied at the fitting stage rather than to a subsequent method. The output is simply an array.

loess

array([46.61471755, 47.1126188 , 47.614013  , 48.11841151, 48.6256075 ,
       49.13570439, 49.64937837, 50.16704135, 50.68898288, 51.21485478,
       51.74389177, 52.27504134, 52.8077399 , 53.34760242, 53.91704886,
       54.54786337, 55.19478926, 55.83994579, 56.52728332, 57.2554603 ,
       57.97007332, 58.7000861 , 59.42998619, 60.15512367, 60.85059141,
       61.56033184, 62.25308342, 62.96834892, 63.67934744, 64.34892233,
       65.03093911, 65.72642563, 66.43953074, 67.1028921 , 67.7933037 ,
       68.32817368, 68.83000516, 69.35904659, 69.70340676, 70.00754158,
       70.27036721, 70.51100124, 70.72755901, 70.95969638, 71.204998  ,
       71.44013117, 71.63889295, 71.80527979, 71.93492097, 72.05389361,
       72.1768891 , 72.3256187 , 72.48270357, 72.64484375, 72.81117218,
       72.97897521, 73.08579142, 73.14812934, 73.17795181, 73.21458167,
       73.23503856, 73.25993723, 73.33334876, 73.44490788, 73.61920275,
       73.79274059, 73.96542348, 74.13251518, 74.27773777, 74.42454684,
       74.56444178, 74.68695253, 74.87464526, 75.04629779, 75.25591509,
       75.46689586, 75.6837427 , 75.8977382 , 76.1079128 , 76.29705072,
       76.48355537, 76.66492653, 76.84188514, 77.01494515, 77.18474753,
       77.35183048, 77.51645496, 77.67862598, 77.83819129, 77.99492115,
       78.14861117, 78.29914752, 78.44647616, 78.59070707, 78.73204886,
       78.87083115, 79.00734738, 79.14174607, 79.27412126, 79.40455257])

The curve can be plotted as before:


# store as data frame
loess_df = pd.DataFrame({'log_gdp': gdp_grid, 'life_exp': loess})

# plot
loess_smooth = alt.Chart(
    loess_df
).mark_line(
    color = 'blue',
    strokeDash = [8, 8]
).encode(
    x = 'log_gdp', 
    y = alt.Y('life_exp', scale = alt.Scale(zero = False))
)

scatter + loess_smooth

If we compare the LOESS curve with the kernel smoother, the different behavior on the boundary is evident:

scatter + loess_smooth + kernel_smooth

Explore

Use the cell below to answer the following questions.

Are there any bandwidths that give you a straight-line fit?
What seems to be the minimum bandwidth?
Which bandwidth best captures the pattern of scatter?

# fit loess smooth
loess = sm.nonparametric.lowess(endog = life.life_exp.values,
                                exog = life.log_gdp.values, 
                                frac = 0.3,
                                xvals = gdp_grid)

# store as data frame
loess_df = pd.DataFrame({'log_gdp': gdp_grid, 'life_exp': loess})

# plot
loess_smooth = alt.Chart(
    loess_df
).mark_line(
    color = 'black'
).encode(
    x = 'log_gdp', 
    y = alt.Y('life_exp', scale = alt.Scale(zero = False))
)

scatter + loess_smooth