Smoothing activity

import os
os.environ["OMP_NUM_THREADS"] = "1"
import numpy as np
import pandas as pd
import altair as alt
import statsmodels.api as sm
from statsmodels.nonparametric.kde import kernel_switch
from sklearn import mixture
alt.data_transformers.disable_max_rows()
DataTransformerRegistry.enable('default')

We’ll work with sustainability index data for US cities to explore density estimation further. These data are imported below

sust = pd.read_csv('sustainability.csv')
sust.head()
GEOID_MSA Name Econ_Domain Social_Domain Env_Domain Sustain_Index
0 310M300US10100 Aberdeen, SD Micro Area 0.565264 0.591259 0.444472 1.600995
1 310M300US10140 Aberdeen, WA Micro Area 0.427671 0.520744 0.429274 1.377689
2 310M300US10180 Abilene, TX Metro Area 0.481092 0.496874 0.454192 1.432157
3 310M300US10220 Ada, OK Micro Area 0.466566 0.526206 0.425964 1.418737
4 310M300US10300 Adrian, MI Micro Area 0.497908 0.602440 0.282499 1.382847

There are 933 distinct cities; each row in the dataset is an observation for one city.

n, p = sust.shape
n
933
sust.Name.unique().shape
(933,)

Univariate density estimation

Let’s use the environmental sustainability index (Env_Domain) initially. The distribution of environmental sustainability across cities is shown below.

hist = alt.Chart(
    sust
).transform_bin(
    'esi',
    field = 'Env_Domain',
    bin = alt.Bin(step = 0.02)
).transform_aggregate(
    count = 'count()',
    groupby = ['esi']
).transform_calculate(
    Density = 'datum.count/(0.02*933)',
    ESI = 'datum.esi + 0.01'
).mark_bar(size = 8).encode(
    x = 'ESI:Q',
    y = 'Density:Q'
)

hist

In lab, you used Altair’s built-in density estimation method. This computes and plots a Gaussian KDE. A common choice for bandwidth is Scott’s rule: \(b = 1.06 \times s \times n^{-1/5}\), where \(s\) is the sample standard deviation. (Altair’s default behavior if no bandwidth is specified is close, but not exactly the same.)

# default bandwitdth parameter
sigma_hat = sust.Env_Domain.std()
bw_scott = 1.06*sigma_hat*n**(-1/5)
bw_scott
0.0146178133632948
smooth = alt.Chart(
    sust
).transform_density(
  'Env_Domain',
  as_ = ['Environmental sustainability index', 'Density'],
  extent = [0.1, 0.8],
  bandwidth = bw_scott
).mark_line(color = 'black').encode(
    x = 'Environmental sustainability index:Q',
    y = 'Density:Q'
)

hist + smooth

Let’s explore alternative kernels. To do this, we’ll need to compute the density estimate separately before plotting. The statsmodels package has KDE implementations with multiple kernels.

To start, let’s recompute the Gaussian KDE with the same bandwidth just to confirm we obtain the same result.

# compute the gaussian KDE using statsmodels
kde = sm.nonparametric.KDEUnivariate(sust.Env_Domain)
kde.fit(bw = bw_scott)
<statsmodels.nonparametric.kde.KDEUnivariate at 0x23251a9eb50>

Executing this cell produces an object kde with the density estimate and other quantities of interest stored as attributes. We’ll arrange these into a dataframe:

kde_df = pd.DataFrame({'Environmental sustainability index': kde.support, 
                       'Density': kde.density})
kde_df.head()
Environmental sustainability index Density
0 0.088614 0.000609
1 0.089334 0.000621
2 0.090054 0.000645
3 0.090774 0.000682
4 0.091494 0.000731

Plotting the KDE we just computed on top of the Altair version reveals they are identical.

smooth2 = alt.Chart(
    kde_df
).mark_line(color = 'black').encode(
    x = 'Environmental sustainability index:Q',
    y = alt.Y('Density:Q', scale = alt.Scale(domain = [0, 12]))
)

smooth + smooth2

Now let’s try varying the kernel. Here are our options:

kernel_switch.keys()
dict_keys(['gau', 'epa', 'uni', 'tri', 'biw', 'triw', 'cos', 'cos2', 'tric'])

The ‘local histogram’ introduced last lecture is a KDE with a uniform kernel. We can compute and plot it as follows:

# compute density estimate
kde = sm.nonparametric.KDEUnivariate(sust.Env_Domain)
kde.fit(kernel = 'uni', fft = False, bw = 0.02)

# arrange as dataframe
kde_df = pd.DataFrame({'Environmental sustainability index': kde.support, 'Density': kde.density})

# plot
smooth2 = alt.Chart(
    kde_df
).mark_line(color = 'black').encode(
    x = 'Environmental sustainability index:Q',
    y = alt.Y('Density:Q', scale = alt.Scale(domain = [0, 12]))
)

hist + smooth2 + smooth2.mark_area(opacity = 0.3)

Exploration

Use the cell below to experiment and answer the following.

  • How does the KDE differ if a parabolic (epa) kernel is used in place of a Gaussian (gau) kernel while the bandwidth is held constant?
  • What effect does a triangular kernel (tri) have on how local peaks appear?
  • Pick two kernels. What will happen to the KDE for large bandwidths?
  • Which kernel seems to do the best job at capturing the shape closely without under-smoothing?
# compute density estimate
kde = sm.nonparametric.KDEUnivariate(sust.Env_Domain)
kde.fit(kernel = 'uni', fft = False, bw = 0.02)

# arrange as dataframe
kde_df = pd.DataFrame({'Environmental sustainability index': kde.support, 'Density': kde.density})

# plot
smooth2 = alt.Chart(
    kde_df
).mark_line(color = 'black').encode(
    x = 'Environmental sustainability index:Q',
    y = alt.Y('Density:Q', scale = alt.Scale(domain = [0, 12]))
)

hist + smooth2 + smooth2.mark_area(opacity = 0.3)

Record your observations here.

Multivariate KDE

We’ll use the same data to illustrate multivariate density estimation. For this we’ll consider the joint distribution of environmental and economic sustainability indices. The next cell generates a plot of the values.

alt.Chart(
    sust
).mark_point().encode(
    x = alt.X('Env_Domain', title = 'Environmental sustainability'),
    y = alt.Y('Econ_Domain', title = 'Economic sustainability')
)

There are a few options for displaying a 2-D histogram. One is to bin and plot a heatmap:

alt.Chart(
    sust
).mark_rect().encode(
    x = alt.X('Env_Domain', bin = alt.Bin(maxbins = 40), title = 'Environmental sustainability'),
    y = alt.Y('Econ_Domain', bin = alt.Bin(maxbins = 40), title = 'Economic sustainability'),
    color = alt.Color('count()', scale = alt.Scale(scheme = 'bluepurple'), title = 'Number of U.S. cities')
)

Another option is to make a bubble chart with the size of the bubble proportional to the count of observations in the corresponding bin.

alt.Chart(
    sust
).mark_circle().encode(
    x = alt.X('Env_Domain', bin = alt.Bin(maxbins = 40), title = 'Environmental sustainability'),
    y = alt.Y('Econ_Domain', bin = alt.Bin(maxbins = 40), title = 'Economic sustainability'),
    size = alt.Size('count()', scale = alt.Scale(scheme = 'bluepurple'))
)

The cell below computes a multivariate Gaussian KDE. There is minimal control over bandwidth specification – just a few estimation methods rather than a numerical input. We’ll use the defaults for the purposes of illustration.

# retrieve data as 2-d array (num observations, num variables)
fit_ary = sust.loc[:, ['Env_Domain', 'Econ_Domain']].values

# compute KDE
kde = sm.nonparametric.KDEMultivariate(fit_ary, 'cc')
kde
KDE instance
Number of variables: k_vars = 2
Number of samples:   nobs = 933
Variable types:      cc
BW selection method: normal_reference

To visualize this estimate, a bit of work is required. kde.pdf([x1, x2, ...]) will compute the estimated density at the point x1, x2, ...; so we will make a fine grid spanning the range of the data, compute the estimated density for each grid point, and then construct a graphic.

# resolution of grid (number of points to use along each axis)
grid_res = 100

# find grid point coordinates along each axis
x1 = np.linspace(start = sust.Env_Domain.min(), stop = sust.Env_Domain.max(), num = grid_res)
x2 = np.linspace(start = sust.Econ_Domain.min(), stop = sust.Econ_Domain.max(), num = grid_res)

# generate a mesh from the coordinates
grid1, grid2 = np.meshgrid(x1, x2, indexing = 'ij')
grid_ary = np.array([grid1.ravel(), grid2.ravel()]).T

# compute the density at each grid point
f_hat = kde.pdf(grid_ary)

# rearrange as a dataframe
grid_df = pd.DataFrame({'env': grid1.reshape(grid_res**2), 
                        'econ': grid2.reshape(grid_res**2),
                        'density': f_hat})

# preview, for understanding
grid_df.head()
env econ density
0 0.132467 0.143811 1.325548e-17
1 0.132467 0.150278 6.367762e-17
2 0.132467 0.156745 2.929104e-16
3 0.132467 0.163212 1.290785e-15
4 0.132467 0.169679 5.451826e-15

Since Altair doesn’t have the option to make contour plots, we’ll make a heatmap. To compare this with the data, let’s overlay the heatmap on the bubble chart.

# kde
kde_smooth = alt.Chart(
    grid_df, title = 'Gaussian KDE'
).mark_rect(opacity = 0.8).encode(
    x = alt.X('env', bin = alt.Bin(maxbins = grid_res), title = 'Environmental sustainability'),
    y = alt.Y('econ', bin = alt.Bin(maxbins = grid_res), title = 'Economic sustainability'),
    color = alt.Color('mean(density)', # a bit hacky, but works
                      scale = alt.Scale(scheme = 'bluepurple', type = 'sqrt'),
                      title = 'Density')
)

# histogram
bubble = alt.Chart(
    sust
).mark_circle().encode(
    x = alt.X('Env_Domain', bin = alt.Bin(maxbins = 40), title = 'Environmental sustainability'),
    y = alt.Y('Econ_Domain', bin = alt.Bin(maxbins = 40), title = 'Economic sustainability'),
    size = alt.Size('count()', scale = alt.Scale(scheme = 'bluepurple'))
)

# layer
bubble + kde_smooth

Mixture models

Mixture models provide one alternative to kernel density estimation. These parametrize a distribution as a mix of (usually) Gaussian densities:

\[ f(x) = a_1 \varphi(x; \mu_1, \sigma_1) + \cdots + a_n \varphi(x; \mu_n, \sigma_n) \]

Above the density \(f\) is a mixture of \(n\) components, where the coefficients \(a_1, \dots, a_n\) are the mixing parameters and each component is a Gaussian density with its own mean and variance. Fitting a mixture model involves estimating the mixing parameters and component means and variances.

The cell below estimates a bivariate mixture model with two components for the joint distribution of environmental sustainability and economic sustainability.

# configure and fit mixture model
gmm = mixture.GaussianMixture(n_components = 2, covariance_type = 'full')
gmm.fit(fit_ary)
GaussianMixture(n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can inspect the centers of the component densities:

centers = pd.DataFrame(gmm.means_).rename(columns = {0: 'env', 1: 'econ'})
centers
env econ
0 0.398172 0.486619
1 0.440855 0.452837

As well as the variances, which in this case are matrices since we are mixing bivariate densities:

gmm.covariances_
array([[[0.00534737, 0.00067683],
        [0.00067683, 0.00384107]],

       [[0.00058321, 0.00021243],
        [0.00021243, 0.0093566 ]]])

Perhaps more informative are the mixing parameters:

gmm.weights_
array([0.40065273, 0.59934727])

These tell us that the joint distribution is about a 60-40 mixture of the two components.

Now, plotting the result is similar to what we had before:

# evaluate log-likelihood
grid_df['gmm'] = np.exp(gmm.score_samples(grid_ary))

# gmm
gmm_smooth = alt.Chart(
    grid_df, title = 'GMM'
).mark_rect(opacity = 0.8).encode(
    x = alt.X('env', bin = alt.Bin(maxbins = grid_res), title = 'Environmental sustainability'),
    y = alt.Y('econ', bin = alt.Bin(maxbins = grid_res), title = 'Economic sustainability'),
    color = alt.Color('mean(gmm)', # a bit hacky, but works
                      scale = alt.Scale(scheme = 'bluepurple', type = 'sqrt'),
                      title = 'Density')
)

# centers of mixture components
ctr = alt.Chart(centers).mark_point(color = 'black', shape = 'triangle').encode(x = 'env', y = 'econ')

# layer
bubble + gmm_smooth + ctr

A side-by-side comparison shows not much difference between the two, except for a bit more smoothness for the GMM:

(bubble + gmm_smooth + ctr) | (bubble + kde_smooth)