There are 933 distinct cities; each row in the dataset is an observation for one city.
n, p = sust.shapen
933
sust.Name.unique().shape
(933,)
Univariate density estimation
Let’s use the environmental sustainability index (Env_Domain) initially. The distribution of environmental sustainability across cities is shown below.
hist = alt.Chart( sust).transform_bin('esi', field ='Env_Domain',bin= alt.Bin(step =0.02)).transform_aggregate( count ='count()', groupby = ['esi']).transform_calculate( Density ='datum.count/(0.02*933)', ESI ='datum.esi + 0.01').mark_bar(size =8).encode( x ='ESI:Q', y ='Density:Q')hist
In lab, you used Altair’s built-in density estimation method. This computes and plots a Gaussian KDE. A common choice for bandwidth is Scott’s rule: \(b = 1.06 \times s \times n^{-1/5}\), where \(s\) is the sample standard deviation. (Altair’s default behavior if no bandwidth is specified is close, but not exactly the same.)
Let’s explore alternative kernels. To do this, we’ll need to compute the density estimate separately before plotting. The statsmodels package has KDE implementations with multiple kernels.
To start, let’s recompute the Gaussian KDE with the same bandwidth just to confirm we obtain the same result.
# compute the gaussian KDE using statsmodelskde = sm.nonparametric.KDEUnivariate(sust.Env_Domain)kde.fit(bw = bw_scott)
<statsmodels.nonparametric.kde.KDEUnivariate at 0x23251a9eb50>
Executing this cell produces an object kde with the density estimate and other quantities of interest stored as attributes. We’ll arrange these into a dataframe:
We’ll use the same data to illustrate multivariate density estimation. For this we’ll consider the joint distribution of environmental and economic sustainability indices. The next cell generates a plot of the values.
alt.Chart( sust).mark_point().encode( x = alt.X('Env_Domain', title ='Environmental sustainability'), y = alt.Y('Econ_Domain', title ='Economic sustainability'))
There are a few options for displaying a 2-D histogram. One is to bin and plot a heatmap:
alt.Chart( sust).mark_rect().encode( x = alt.X('Env_Domain', bin= alt.Bin(maxbins =40), title ='Environmental sustainability'), y = alt.Y('Econ_Domain', bin= alt.Bin(maxbins =40), title ='Economic sustainability'), color = alt.Color('count()', scale = alt.Scale(scheme ='bluepurple'), title ='Number of U.S. cities'))
Another option is to make a bubble chart with the size of the bubble proportional to the count of observations in the corresponding bin.
alt.Chart( sust).mark_circle().encode( x = alt.X('Env_Domain', bin= alt.Bin(maxbins =40), title ='Environmental sustainability'), y = alt.Y('Econ_Domain', bin= alt.Bin(maxbins =40), title ='Economic sustainability'), size = alt.Size('count()', scale = alt.Scale(scheme ='bluepurple')))
The cell below computes a multivariate Gaussian KDE. There is minimal control over bandwidth specification – just a few estimation methods rather than a numerical input. We’ll use the defaults for the purposes of illustration.
# retrieve data as 2-d array (num observations, num variables)fit_ary = sust.loc[:, ['Env_Domain', 'Econ_Domain']].values# compute KDEkde = sm.nonparametric.KDEMultivariate(fit_ary, 'cc')kde
KDE instance
Number of variables: k_vars = 2
Number of samples: nobs = 933
Variable types: cc
BW selection method: normal_reference
To visualize this estimate, a bit of work is required. kde.pdf([x1, x2, ...]) will compute the estimated density at the point x1, x2, ...; so we will make a fine grid spanning the range of the data, compute the estimated density for each grid point, and then construct a graphic.
# resolution of grid (number of points to use along each axis)grid_res =100# find grid point coordinates along each axisx1 = np.linspace(start = sust.Env_Domain.min(), stop = sust.Env_Domain.max(), num = grid_res)x2 = np.linspace(start = sust.Econ_Domain.min(), stop = sust.Econ_Domain.max(), num = grid_res)# generate a mesh from the coordinatesgrid1, grid2 = np.meshgrid(x1, x2, indexing ='ij')grid_ary = np.array([grid1.ravel(), grid2.ravel()]).T# compute the density at each grid pointf_hat = kde.pdf(grid_ary)# rearrange as a dataframegrid_df = pd.DataFrame({'env': grid1.reshape(grid_res**2), 'econ': grid2.reshape(grid_res**2),'density': f_hat})# preview, for understandinggrid_df.head()
env
econ
density
0
0.132467
0.143811
1.325548e-17
1
0.132467
0.150278
6.367762e-17
2
0.132467
0.156745
2.929104e-16
3
0.132467
0.163212
1.290785e-15
4
0.132467
0.169679
5.451826e-15
Since Altair doesn’t have the option to make contour plots, we’ll make a heatmap. To compare this with the data, let’s overlay the heatmap on the bubble chart.
# kdekde_smooth = alt.Chart( grid_df, title ='Gaussian KDE').mark_rect(opacity =0.8).encode( x = alt.X('env', bin= alt.Bin(maxbins = grid_res), title ='Environmental sustainability'), y = alt.Y('econ', bin= alt.Bin(maxbins = grid_res), title ='Economic sustainability'), color = alt.Color('mean(density)', # a bit hacky, but works scale = alt.Scale(scheme ='bluepurple', type='sqrt'), title ='Density'))# histogrambubble = alt.Chart( sust).mark_circle().encode( x = alt.X('Env_Domain', bin= alt.Bin(maxbins =40), title ='Environmental sustainability'), y = alt.Y('Econ_Domain', bin= alt.Bin(maxbins =40), title ='Economic sustainability'), size = alt.Size('count()', scale = alt.Scale(scheme ='bluepurple')))# layerbubble + kde_smooth
Mixture models
Mixture models provide one alternative to kernel density estimation. These parametrize a distribution as a mix of (usually) Gaussian densities:
Above the density \(f\) is a mixture of \(n\) components, where the coefficients \(a_1, \dots, a_n\) are the mixing parameters and each component is a Gaussian density with its own mean and variance. Fitting a mixture model involves estimating the mixing parameters and component means and variances.
The cell below estimates a bivariate mixture model with two components for the joint distribution of environmental sustainability and economic sustainability.
# configure and fit mixture modelgmm = mixture.GaussianMixture(n_components =2, covariance_type ='full')gmm.fit(fit_ary)
GaussianMixture(n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GaussianMixture(n_components=2)
We can inspect the centers of the component densities:
Perhaps more informative are the mixing parameters:
gmm.weights_
array([0.40065273, 0.59934727])
These tell us that the joint distribution is about a 60-40 mixture of the two components.
Now, plotting the result is similar to what we had before:
# evaluate log-likelihoodgrid_df['gmm'] = np.exp(gmm.score_samples(grid_ary))# gmmgmm_smooth = alt.Chart( grid_df, title ='GMM').mark_rect(opacity =0.8).encode( x = alt.X('env', bin= alt.Bin(maxbins = grid_res), title ='Environmental sustainability'), y = alt.Y('econ', bin= alt.Bin(maxbins = grid_res), title ='Economic sustainability'), color = alt.Color('mean(gmm)', # a bit hacky, but works scale = alt.Scale(scheme ='bluepurple', type='sqrt'), title ='Density'))# centers of mixture componentsctr = alt.Chart(centers).mark_point(color ='black', shape ='triangle').encode(x ='env', y ='econ')# layerbubble + gmm_smooth + ctr
A side-by-side comparison shows not much difference between the two, except for a bit more smoothness for the GMM:
The results look similar, but the estimates are very different. The GMM is a parametric model, whereas the KDE is nonparametric: the KDE uses all of the data (933 values of two variables) to estimate the value of the density comprises at any given location, whereas the GMM uses just 14 parameters.
The GMM is especially useful if there are latent subpopulations in the data. To consider an example, recall the simulated hawks data from lab 2 of body lengths for males and females.
hist_hawks = alt.Chart(samp).transform_bin('length', field ='length',bin= alt.Bin(step =2)).transform_aggregate( count ='count()', groupby = ['length']).transform_calculate( density ='datum.count/1000', length ='datum.length + 1').mark_bar(size =20).encode( x ='length:Q', y ='density:Q')hist_hawks
Now let’s fit a mixture model with two components:
# configure and fit mixture modelgmm_hawks = mixture.GaussianMixture(n_components =2)gmm_hawks.fit(samp.length.values.reshape(-1, 1))
GaussianMixture(n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GaussianMixture(n_components=2)
Now, amazingly, the mixture components accurately recover the distributions for each sex, even though this was a latent (unobserved) variable:
print('gmm component means: ', gmm_hawks.means_.ravel())print('population means by sex: ', population_hawks.groupby('sex').mean().values.ravel())
gmm component means: [50.70543698 57.33071086]
population means by sex: [57.5749014 50.48194081]
Ditto variances:
print('gmm component variances: ', gmm_hawks.covariances_.ravel())print('population variances by sex: ', population_hawks.groupby('sex').var().values.ravel())
gmm component variances: [ 8.60692088 10.12549367]
population variances by sex: [9.39645762 8.522606 ]
And the density estimate can be constructed as before, by computing the value of the GMM along a grid of lengths.
# compute a grid of lengthsgrid_hawks = np.linspace(population_hawks.length.min(), population_hawks.length.max(), num =500)dens = np.exp(gmm_hawks.score_samples(grid_hawks.reshape(-1, 1)))
Here’s a graphic:
gmm_smooth_hawks = alt.Chart( pd.DataFrame({'length': grid_hawks, 'density': dens})).mark_line(color ='black').encode( x ='length', y ='density')hist_hawks + gmm_smooth_hawks
Explore
Use the cell below to explore the following questions.
What happens if you fit the GMM with different numbers of components?
Does the solution change if the GMM is re-fitted?
# configure and fit mixture modelgmm_hawks = mixture.GaussianMixture(n_components =2)gmm_hawks.fit(samp.length.values.reshape(-1, 1))# print centersprint('component center(s): ', gmm_hawks.means_)# compute a grid of lengthsgrid_hawks = np.linspace(population_hawks.length.min(), population_hawks.length.max(), num =500)dens = np.exp(gmm_hawks.score_samples(grid_hawks.reshape(-1, 1)))# plotgmm_smooth_hawks = alt.Chart( pd.DataFrame({'length': grid_hawks, 'density': dens})).mark_line(color ='black').encode( x ='length', y ='density')hist_hawks + gmm_smooth_hawks
Kernel smoothing refers to local averaging via a kernel function. It is conceptually very similar to kernel density estimation, but used to visualize and estimate relationships rather than distributions. To explore further, we’ll use the GDP and life expectancy data from lab 3.
life = pd.read_csv('life-gdp.csv')life = life[life.Year ==2000].loc[:, ['Country Name', 'All', 'GDP per capita']]life['GDP per capita'] = np.log(life['GDP per capita'])life.rename(columns = {'GDP per capita': 'log_gdp', 'All': 'life_exp', 'Country Name': 'country'}, inplace=True)
Note that the data has been filtered to a single year (2000). A scatter plot of log-GDP-per-capita against life expectancy shows a nonlinear trend.
scatter = alt.Chart( life).mark_point().encode( x = alt.X('log_gdp', scale = alt.Scale(zero =False), title ='log(GDP/capita)'), y = alt.Y('life_exp', scale = alt.Scale(zero =False), title ='Life expectancy'))scatter
Kernel smoothing is a technique for estimating the conditional mean of one variable given one or more others – in this case, we’ll estimate the mean life expectancy given GDP per capita. The technique consists in fixing GDP per capita and then averaging life expectancy locally among observations having a similar GDP per capita; a kernel function is used to determine the exact weights.
The resulting object has a .fit() method that can be used to predict new values, given a GDP per capita:
kernreg.fit(np.array([4.7]))[0]
array([52.08267076])
Computing fits across a grid of GDP values provides a means of visualizing the smoothed trend:
# grid of gdp valuesgdp_grid = np.linspace(life.log_gdp.min(), life.log_gdp.max(), num =100)# calculate kernel smooth at each valuefitted_values = kernreg.fit(gdp_grid)[0]# arrange as data framepred_df = pd.DataFrame({'log_gdp': gdp_grid, 'life_exp': fitted_values})# plotkernel_smooth = alt.Chart( pred_df).mark_line( color ='black').encode( x ='log_gdp', y = alt.Y('life_exp', scale = alt.Scale(zero =False)))scatter + kernel_smooth
Notice that the kernel smooth trails off a bit near the boundary – this is because fewer points are being averaged once the smoothing window begins to extend beyond the range of the data.
Locally weighted scatterplot smoothing (LOESS or LOWESS) largely corrects this issue by stitching together lines (or low-order polynomials) fitted to local subsets of data; this way, near the boundary, the estimate still takes account of any trend present in the data.
LOESS still has a smoothing bandwidth; however, in statsmodels.nonparametric.lowess, this is specified by the fraction of the data – and thus the local neighborhood size – used to estimate the mean at any given point.
This is implemented a bit differently than the kernel smoothing, in that the values at which to estimate the mean are supplied at the fitting stage rather than to a subsequent method. The output is simply an array.