PSTAT100 - Exploratory analysis and density estimation

Announcements

No assignment posted this week
No lab next week

This week: EDA and smoothing

What is exploratory data analysis (EDA)?
- The role of data: information or evidence?
- Exploratory vs. confirmatory analysis
- Essential exploratory questions: variation and co-variation
Smoothing
- Kernel density estimation (KDE)
- LOESS

EDA

The term and spirit of exploratory data analysis (EDA) is attributed to John Tukey, whose philosophically-leaning work in statistics in the 1960’s and 1970’s stressed the need for more data-driven methods.

For a long time I have thought I was a statistician, interested in inferences from the particular to the general … All in all, I have come to feel that my central interest is in data analysis [which] is a larger and more varied field than inference. (Tukey, 1962)

EDA is an initial stage of non-inferential and largely model-free analysis aiming at understanding the structure, patterns, and particularities present in a dataset.

Data as evidence

Experimental data usually serve the role of evidence for or against prespecified hypotheses.

For example, in vaccine efficacy trials, trial data are collected precisely to affirm or refute the hypothesis of no effect:

Vaccine group	Placebo group
11 cases	185 cases

Data as information

By contrast, observational data more often serve the role of information about some phenomenon.

For example, a secondary trial target might is to assess saftey by gathering observational data on side effects; for this there is no hypothesis.

EDA then CDA

The picture that most practitioners have of modern data science is that EDA precedes confirmatory data analysis (CDA).

EDA is used to generate hypotheses or formulate a model based on patterns in the data
CDA, consisting of model specification and estimation, is used for inference and/or prediction

Aside: Historically, statistics has focused on CDA – and therefore a lot of your PSTAT coursework does, too.

Essential exploratory questions

When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

There are two basic kinds of questions that are always useful:

What type of variation occurs within variables?
What type of covariation occurs between variables?

Variation in one variable

Variation in data is the tendency of values to change from measurement to measurement.

For example, the following observations from your mini project data are around 8 , but each one is different.

	PM25_mean	City	State	Year
0	8.6	Aberdeen	SD	2000
1	8.6	Aberdeen	SD	2001
2	7.9	Aberdeen	SD	2002
3	8.4	Aberdeen	SD	2003

What does it mean to ask what ‘type’ of variation there is in this data?

Questions about variation

There aren’t exact types of variation, but here are some useful questions:

(Common) Which values are most common?
(Rare) Which values are rare?
(Spread) How spread out are the values and how are they spread out?
(Shape) Are values spread out evenly or irregularly?

These questions often lead the way to more focused ones.

Air quality

The following histogram shows the distribution of PM 2.5 concentrations across all 200 cities and 20 years.

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

It shows several statistical properties of the data related to variation:

The common values have the highest bars – values between roughly 6 and 14.
Values under 4 and over 18 are rare, accounting for under 5% of the data.
Values are concentrated between 4 and 18, but are spread from 2 to 52.
The shape is pretty even but a little more spread out to the right (“right skew”).

Question refinement

New question: The national standard is 12 micrograms per cubic meter. Over 1,000 measurements exceeded this. Was it just a few cities, or more widespread?

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Many cities exceeded the standard at some point in time: over 70% of the cities in the dataset. So it was more widespread, but these were the worst:

City                 State
Hanford-Corcoran      CA      20
Visalia-Porterville   CA      20
Fresno                CA      19
Bakersfield           CA      18
Name: Years exceeding standard, dtype: int64

Further questions

How many cities exceed the benchmark each year? Does this change from year to year?

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

There are a lot of years and it’s hard to see anything with all the overlapping bars.

Remember the rule? Don’t stack histograms.
Use density plots instead.

Further questions

Visually, it’s a lot easier to distinguish overlapping lines than overlapping bars. Smoothing out the histogram produces this:

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

This shows that both the variation in PM 2.5 and the typical values are diminishing over time.

suggests fewer cities exceed the EPA standard (12 ) over time
a few outliers in some early year
not the best presentation graphic, but useful exploratory graphic

Further questions

This gets the message across better.

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

And here are those outlying values:

	PM25_mean	City	State	Year
1984	51.2	Fairbanks	AK	2004
1985	31.3	Fairbanks	AK	2005

Density estimates

All of the above has amounted to exploration of the distribution of PM 2.5 values across years and cities.

Density estimates provide smooth approximations of distributions:

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

These are useful tools for answering questions about variation. Relative to the histogram:

Easier to see the shape, spread, and typical values quickly.
Easier to compare multiple distributions.

Histograms and probability densities

From 120A, a probability density/mass function has two properties:

Nonnegative: for every .
Sums/integrates to one: or

Histograms are almost proper density functions: they satisfy (1) but not (2).

A preliminary: indicator functions

In what follows we’re going to express the histogram mathematically as a function of data values.

This will require the use of indicator functions, which are simply functions that are 1 or 0 depending on a condition. They are denoted like this:

The sum of an indicator gives a count of how many times the condition is met:

Count scale histograms

When the bar height is a count of the number of observations in each bin, the histogram is on the count scale.

More precisely, if the values are , then the height of the bar for the th bin is:

Bad for comparisons: the bar heights incomparable in scale whenever the bin widths and/or sample sizes differ.

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Density scale histograms

A fix that ensures comparability of scale for any two histograms is to normalize heights by bin width and sample size .

Now the area under the histogram is , so we call this a density scale histogram, because it is a valid probability density.

Smoothing

Kernel density estimates are local smoothings of the density scale histogram.

This can be seen by comparing the type of smooth curve we saw earlier with the density scale histogram.

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

How KDE works

Technically, KDE is a convolution filtering. We can try to understand it in more intuitive terms by developing the idea constructively from the density histogram in two steps.

Do locally adaptive binning
Replace counting by weighted aggregation

The histogram as a step function

The value (height) of the density scale histogram at an arbitrary point is

Here’s what those indicators do:

A ‘local’ histogram

One could do a ‘moving window’ binning by allowing the height at to be a normalization of the count in a neighborhood of of width rather than in one of a fixed set of bins:

Let’s call this a local histogram, because the height at any point is determined relative to the exact location of .

Drawing a local histogram

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

PM 2.5 data

Here’s what that would look like with for the air quality data:

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

PM 2.5 data

Zooming in reveals that this is still a step function:

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

The kernel function

The local histogram is in fact a density estimate with a uniform ‘kernel’:

uniform because the kernel function is constant about
when are a random sample, this is an estimate of the population denisty

Gaussian KDE

Replacing the uniform kernel with a Gaussian kernel yields a smooth density estimate:

is the standard Gaussian density
is the smoothing ‘bandwidth’

In effect, the KDE curve at any point is a weighted aggregation of all the data with weights proportional to their distance from .

Smoothing bandwidth

The bandwidth parameter controls how wiggly the KDE curve is.

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

The choice of smoothing bandwidth can change the visual impression.

too much smoothing can obscure outliers and multiple modes
too little smoothing can misleadingly overemphasize sample artefacts

Other kernels

In general, a KDE can be computed with any appropriately normalized nonnegative kernel function .

Other common kernel functions include:

triangular kernel
parabolic kernel
cosine kernel
circular density
any symmetric continuous probability density function

KDE in higher dimensions

Usually for multivariate data it’s easier to work with conditional distributions, but KDE can be generalized to estimating joint densities in dimensions:

is a -dimensional vector
is a nonnegative kernel function
is a matrix of bandwidth parameters

The usual approach is to decorrelate the variates and apply a product kernel with separate bandwidths for each dimension.

Bivariate example

Bivariate histogram shown as a raster plot

Does race time seem correlated with runner’s age?