Density estimation, mixture models, and smoothing

PSTAT100 Spring 2023

Invalid Date

Outline for today

  • More on density estimation
    • non-Gaussian kernels
    • multivariate KDE
    • mixture models
  • Scatterplot smoothing
    • Kernel smoothing
    • LOESS

Sustainability data

We’ll work with sustainability index data for US cities to explore density estimation further.

Code
sust = pd.read_csv('data/sustainability.csv')
sust.iloc[:, 1:5].head(2)
Name Econ_Domain Social_Domain Env_Domain
0 Aberdeen, SD Micro Area 0.565264 0.591259 0.444472
1 Aberdeen, WA Micro Area 0.427671 0.520744 0.429274
  • 933 distinct cities
  • indices for sustainability in enviornmental, social, and eonomic domains

Environmental sustainability index

Let’s use the environmental sustainability index (Env_Domain) initially. The distribution of environmental sustainability across cities is shown below.

Code
hist = alt.Chart(
    sust
).transform_bin(
    'esi',
    field = 'Env_Domain',
    bin = alt.Bin(step = 0.02)
).transform_aggregate(
    count = 'count()',
    groupby = ['esi']
).transform_calculate(
    Density = 'datum.count/(0.02*933)',
    ESI = 'datum.esi + 0.01'
).mark_bar(size = 8).encode(
    x = 'ESI:Q',
    y = 'Density:Q'
)

hist.configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
)

KDE bandwidth selection

A common choice for Gaussian KDE bandwidth is Scott’s rule:

\[ b = 1.06 \times s \times n^{-1/5} \]

Code
# bandwitdth parameter
n, p = sust.shape
sigma_hat = sust.Env_Domain.std()
bw_scott = 1.06*sigma_hat*n**(-1/5)

# plot
smooth = alt.Chart(
    sust
).transform_density(
  'Env_Domain',
  as_ = ['Environmental sustainability index', 'Density'],
  extent = [0.1, 0.8],
  bandwidth = bw_scott
).mark_line(color = 'black').encode(
    x = 'Environmental sustainability index:Q',
    y = 'Density:Q'
)

(hist + smooth).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
)

KDEs with statsmodels

The statsmodels package has KDE implementations with finer control. This will allow us to experiment with other kernels.

# compute the gaussian KDE using statsmodels
kde = sm.nonparametric.KDEUnivariate(sust.Env_Domain)
kde.fit(bw = bw_scott)
<statsmodels.nonparametric.kde.KDEUnivariate at 0x15a506f50>

The object kde has .support (\(x\)) and .density (\(\hat{f}(x)\)) attributes:

Code
kde_df = pd.DataFrame({
    'Environmental sustainability index': kde.support, 
    'Density': kde.density
    })
kde_df.head(3)
Environmental sustainability index Density
0 0.088614 0.000609
1 0.089334 0.000621
2 0.090054 0.000645

Other kernels

Now let’s try varying the kernel. The ‘local histogram’ introduced last lecture is a KDE with a uniform kernel.

Code
# compute density estimate
kde = sm.nonparametric.KDEUnivariate(sust.Env_Domain)
kde.fit(kernel = 'uni', fft = False, bw = 0.02)

# arrange as dataframe
kde_df = pd.DataFrame({'Environmental sustainability index': kde.support, 'Density': kde.density})

# plot
smooth2 = alt.Chart(
    kde_df
).mark_line(color = 'black').encode(
    x = 'Environmental sustainability index:Q',
    y = alt.Y('Density:Q', scale = alt.Scale(domain = [0, 12]))
)

(hist + smooth2 + smooth2.mark_area(opacity = 0.3)).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
)

Other kernels

  • Titles indicate statsmodels.nonparametric.KDEUnivariate abbreviations
  • Note different axis scales – not as similar as they look!

Explore

Take 5-6 minutes with your neighbor and use the activity notebook to experiment and answer the following.

  • How does the KDE differ if a parabolic (epa) kernel is used in place of a Gaussian (gau) kernel while the bandwidth is held constant?
  • What effect does a triangular kernel (tri) have on how local peaks appear?
  • Pick two kernels. What will happen to the KDE for large bandwidths?
  • Which kernel seems to do the best job at capturing the shape closely without under-smoothing?

Multivariate KDE

Now let’s estimate the joint distribution of environmental and economic sustainability indices. Here are the values:

Code
alt.Chart(
    sust
).mark_point().encode(
    x = alt.X('Env_Domain', title = 'Environmental sustainability'),
    y = alt.Y('Econ_Domain', title = 'Economic sustainability')
).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
)

Multivariate histograms

There are a few options for displaying a 2-D histogram. One is to bin and plot a heatmap, as we saw before:

Code
alt.Chart(
    sust
).mark_rect().encode(
    x = alt.X('Env_Domain', bin = alt.Bin(maxbins = 40), title = 'Environmental sustainability'),
    y = alt.Y('Econ_Domain', bin = alt.Bin(maxbins = 40), title = 'Economic sustainability'),
    color = alt.Color('count()', scale = alt.Scale(scheme = 'bluepurple'), title = 'Number of U.S. cities')
).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
).configure_legend(
    labelFontSize = 14, 
    titleFontSize = 16
)

Multivariate histograms

Another option is to make a bubble chart with the size of the bubble proportional to the count of observations in the corresponding bin:

Code
alt.Chart(
    sust
).mark_circle().encode(
    x = alt.X('Env_Domain', bin = alt.Bin(maxbins = 40), title = 'Environmental sustainability'),
    y = alt.Y('Econ_Domain', bin = alt.Bin(maxbins = 40), title = 'Economic sustainability'),
    size = alt.Size('count()', scale = alt.Scale(scheme = 'bluepurple'))
).configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
).configure_legend(
    labelFontSize = 14, 
    titleFontSize = 16
)