There are 933 distinct cities; each row in the dataset is an observation for one city.
n, p = sust.shapen
933
sust.Name.unique().shape
(933,)
Univariate density estimation
Let’s use the environmental sustainability index (Env_Domain) initially. The distribution of environmental sustainability across cities is shown below.
hist = alt.Chart( sust).transform_bin('esi', field ='Env_Domain',bin= alt.Bin(step =0.02)).transform_aggregate( count ='count()', groupby = ['esi']).transform_calculate( Density ='datum.count/(0.02*933)', ESI ='datum.esi + 0.01').mark_bar(size =8).encode( x ='ESI:Q', y ='Density:Q')hist
In lab, you used Altair’s built-in density estimation method. This computes and plots a Gaussian KDE. A common choice for bandwidth is Scott’s rule: \(b = 1.06 \times s \times n^{-1/5}\), where \(s\) is the sample standard deviation. (Altair’s default behavior if no bandwidth is specified is close, but not exactly the same.)
Let’s explore alternative kernels. To do this, we’ll need to compute the density estimate separately before plotting. The statsmodels package has KDE implementations with multiple kernels.
To start, let’s recompute the Gaussian KDE with the same bandwidth just to confirm we obtain the same result.
# compute the gaussian KDE using statsmodelskde = sm.nonparametric.KDEUnivariate(sust.Env_Domain)kde.fit(bw = bw_scott)
<statsmodels.nonparametric.kde.KDEUnivariate at 0x23251a9eb50>
Executing this cell produces an object kde with the density estimate and other quantities of interest stored as attributes. We’ll arrange these into a dataframe:
We’ll use the same data to illustrate multivariate density estimation. For this we’ll consider the joint distribution of environmental and economic sustainability indices. The next cell generates a plot of the values.
alt.Chart( sust).mark_point().encode( x = alt.X('Env_Domain', title ='Environmental sustainability'), y = alt.Y('Econ_Domain', title ='Economic sustainability'))
There are a few options for displaying a 2-D histogram. One is to bin and plot a heatmap:
alt.Chart( sust).mark_rect().encode( x = alt.X('Env_Domain', bin= alt.Bin(maxbins =40), title ='Environmental sustainability'), y = alt.Y('Econ_Domain', bin= alt.Bin(maxbins =40), title ='Economic sustainability'), color = alt.Color('count()', scale = alt.Scale(scheme ='bluepurple'), title ='Number of U.S. cities'))
Another option is to make a bubble chart with the size of the bubble proportional to the count of observations in the corresponding bin.
alt.Chart( sust).mark_circle().encode( x = alt.X('Env_Domain', bin= alt.Bin(maxbins =40), title ='Environmental sustainability'), y = alt.Y('Econ_Domain', bin= alt.Bin(maxbins =40), title ='Economic sustainability'), size = alt.Size('count()', scale = alt.Scale(scheme ='bluepurple')))
The cell below computes a multivariate Gaussian KDE. There is minimal control over bandwidth specification – just a few estimation methods rather than a numerical input. We’ll use the defaults for the purposes of illustration.
# retrieve data as 2-d array (num observations, num variables)fit_ary = sust.loc[:, ['Env_Domain', 'Econ_Domain']].values# compute KDEkde = sm.nonparametric.KDEMultivariate(fit_ary, 'cc')kde
KDE instance
Number of variables: k_vars = 2
Number of samples: nobs = 933
Variable types: cc
BW selection method: normal_reference
To visualize this estimate, a bit of work is required. kde.pdf([x1, x2, ...]) will compute the estimated density at the point x1, x2, ...; so we will make a fine grid spanning the range of the data, compute the estimated density for each grid point, and then construct a graphic.
# resolution of grid (number of points to use along each axis)grid_res =100# find grid point coordinates along each axisx1 = np.linspace(start = sust.Env_Domain.min(), stop = sust.Env_Domain.max(), num = grid_res)x2 = np.linspace(start = sust.Econ_Domain.min(), stop = sust.Econ_Domain.max(), num = grid_res)# generate a mesh from the coordinatesgrid1, grid2 = np.meshgrid(x1, x2, indexing ='ij')grid_ary = np.array([grid1.ravel(), grid2.ravel()]).T# compute the density at each grid pointf_hat = kde.pdf(grid_ary)# rearrange as a dataframegrid_df = pd.DataFrame({'env': grid1.reshape(grid_res**2), 'econ': grid2.reshape(grid_res**2),'density': f_hat})# preview, for understandinggrid_df.head()
env
econ
density
0
0.132467
0.143811
1.325548e-17
1
0.132467
0.150278
6.367762e-17
2
0.132467
0.156745
2.929104e-16
3
0.132467
0.163212
1.290785e-15
4
0.132467
0.169679
5.451826e-15
Since Altair doesn’t have the option to make contour plots, we’ll make a heatmap. To compare this with the data, let’s overlay the heatmap on the bubble chart.
# kdekde_smooth = alt.Chart( grid_df, title ='Gaussian KDE').mark_rect(opacity =0.8).encode( x = alt.X('env', bin= alt.Bin(maxbins = grid_res), title ='Environmental sustainability'), y = alt.Y('econ', bin= alt.Bin(maxbins = grid_res), title ='Economic sustainability'), color = alt.Color('mean(density)', # a bit hacky, but works scale = alt.Scale(scheme ='bluepurple', type='sqrt'), title ='Density'))# histogrambubble = alt.Chart( sust).mark_circle().encode( x = alt.X('Env_Domain', bin= alt.Bin(maxbins =40), title ='Environmental sustainability'), y = alt.Y('Econ_Domain', bin= alt.Bin(maxbins =40), title ='Economic sustainability'), size = alt.Size('count()', scale = alt.Scale(scheme ='bluepurple')))# layerbubble + kde_smooth
Mixture models
Mixture models provide one alternative to kernel density estimation. These parametrize a distribution as a mix of (usually) Gaussian densities:
Above the density \(f\) is a mixture of \(n\) components, where the coefficients \(a_1, \dots, a_n\) are the mixing parameters and each component is a Gaussian density with its own mean and variance. Fitting a mixture model involves estimating the mixing parameters and component means and variances.
The cell below estimates a bivariate mixture model with two components for the joint distribution of environmental sustainability and economic sustainability.
# configure and fit mixture modelgmm = mixture.GaussianMixture(n_components =2, covariance_type ='full')gmm.fit(fit_ary)
GaussianMixture(n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GaussianMixture(n_components=2)
We can inspect the centers of the component densities:
Perhaps more informative are the mixing parameters:
gmm.weights_
array([0.40065273, 0.59934727])
These tell us that the joint distribution is about a 60-40 mixture of the two components.
Now, plotting the result is similar to what we had before:
# evaluate log-likelihoodgrid_df['gmm'] = np.exp(gmm.score_samples(grid_ary))# gmmgmm_smooth = alt.Chart( grid_df, title ='GMM').mark_rect(opacity =0.8).encode( x = alt.X('env', bin= alt.Bin(maxbins = grid_res), title ='Environmental sustainability'), y = alt.Y('econ', bin= alt.Bin(maxbins = grid_res), title ='Economic sustainability'), color = alt.Color('mean(gmm)', # a bit hacky, but works scale = alt.Scale(scheme ='bluepurple', type='sqrt'), title ='Density'))# centers of mixture componentsctr = alt.Chart(centers).mark_point(color ='black', shape ='triangle').encode(x ='env', y ='econ')# layerbubble + gmm_smooth + ctr
A side-by-side comparison shows not much difference between the two, except for a bit more smoothness for the GMM: