Name | Econ_Domain | Social_Domain | Env_Domain | |
---|---|---|---|---|
0 | Aberdeen, SD Micro Area | 0.565264 | 0.591259 | 0.444472 |
1 | Aberdeen, WA Micro Area | 0.427671 | 0.520744 | 0.429274 |
Invalid Date
We’ll work with sustainability index data for US cities to explore density estimation further.
Name | Econ_Domain | Social_Domain | Env_Domain | |
---|---|---|---|---|
0 | Aberdeen, SD Micro Area | 0.565264 | 0.591259 | 0.444472 |
1 | Aberdeen, WA Micro Area | 0.427671 | 0.520744 | 0.429274 |
Let’s use the environmental sustainability index (Env_Domain
) initially. The distribution of environmental sustainability across cities is shown below.
hist = alt.Chart(
sust
).transform_bin(
'esi',
field = 'Env_Domain',
bin = alt.Bin(step = 0.02)
).transform_aggregate(
count = 'count()',
groupby = ['esi']
).transform_calculate(
Density = 'datum.count/(0.02*933)',
ESI = 'datum.esi + 0.01'
).mark_bar(size = 8).encode(
x = 'ESI:Q',
y = 'Density:Q'
)
hist.configure_axis(
labelFontSize = 14,
titleFontSize = 16
)
A common choice for Gaussian KDE bandwidth is Scott’s rule:
\[ b = 1.06 \times s \times n^{-1/5} \]
# bandwitdth parameter
n, p = sust.shape
sigma_hat = sust.Env_Domain.std()
bw_scott = 1.06*sigma_hat*n**(-1/5)
# plot
smooth = alt.Chart(
sust
).transform_density(
'Env_Domain',
as_ = ['Environmental sustainability index', 'Density'],
extent = [0.1, 0.8],
bandwidth = bw_scott
).mark_line(color = 'black').encode(
x = 'Environmental sustainability index:Q',
y = 'Density:Q'
)
(hist + smooth).configure_axis(
labelFontSize = 14,
titleFontSize = 16
)
The statsmodels
package has KDE implementations with finer control. This will allow us to experiment with other kernels.
# compute the gaussian KDE using statsmodels
kde = sm.nonparametric.KDEUnivariate(sust.Env_Domain)
kde.fit(bw = bw_scott)
<statsmodels.nonparametric.kde.KDEUnivariate at 0x15a506f50>
The object kde
has .support
(\(x\)) and .density
(\(\hat{f}(x)\)) attributes:
Now let’s try varying the kernel. The ‘local histogram’ introduced last lecture is a KDE with a uniform kernel.
# compute density estimate
kde = sm.nonparametric.KDEUnivariate(sust.Env_Domain)
kde.fit(kernel = 'uni', fft = False, bw = 0.02)
# arrange as dataframe
kde_df = pd.DataFrame({'Environmental sustainability index': kde.support, 'Density': kde.density})
# plot
smooth2 = alt.Chart(
kde_df
).mark_line(color = 'black').encode(
x = 'Environmental sustainability index:Q',
y = alt.Y('Density:Q', scale = alt.Scale(domain = [0, 12]))
)
(hist + smooth2 + smooth2.mark_area(opacity = 0.3)).configure_axis(
labelFontSize = 14,
titleFontSize = 16
)
statsmodels.nonparametric.KDEUnivariate
abbreviationsTake 5-6 minutes with your neighbor and use the activity notebook to experiment and answer the following.
epa
) kernel is used in place of a Gaussian (gau
) kernel while the bandwidth is held constant?tri
) have on how local peaks appear?Now let’s estimate the joint distribution of environmental and economic sustainability indices. Here are the values:
There are a few options for displaying a 2-D histogram. One is to bin and plot a heatmap, as we saw before:
alt.Chart(
sust
).mark_rect().encode(
x = alt.X('Env_Domain', bin = alt.Bin(maxbins = 40), title = 'Environmental sustainability'),
y = alt.Y('Econ_Domain', bin = alt.Bin(maxbins = 40), title = 'Economic sustainability'),
color = alt.Color('count()', scale = alt.Scale(scheme = 'bluepurple'), title = 'Number of U.S. cities')
).configure_axis(
labelFontSize = 14,
titleFontSize = 16
).configure_legend(
labelFontSize = 14,
titleFontSize = 16
)
Another option is to make a bubble chart with the size of the bubble proportional to the count of observations in the corresponding bin:
alt.Chart(
sust
).mark_circle().encode(
x = alt.X('Env_Domain', bin = alt.Bin(maxbins = 40), title = 'Environmental sustainability'),
y = alt.Y('Econ_Domain', bin = alt.Bin(maxbins = 40), title = 'Economic sustainability'),
size = alt.Size('count()', scale = alt.Scale(scheme = 'bluepurple'))
).configure_axis(
labelFontSize = 14,
titleFontSize = 16
).configure_legend(
labelFontSize = 14,
titleFontSize = 16
)