PSTAT100 - Data science lifecycle

What’s data science?

Data science is a term of art encompassing a wide range of activities that involve uncovering insights from quantitative information.

People that refer to themselves as data scientists typically combine specific interests (“domain knowledge”, e.g., biology) with computation, mathematics, and statistics and probability to contribute to knowledge in their communities.

Intersectional in nature
No singular disciplinary background among practitioners

Data science lifecycle

Data science lifecycle: an end-to-end process resulting in a data analysis product

Question formulation
Data collection and cleaning
Exploration
Analysis

These form a cycle in the sense that the steps are iterated for question refinement and futher discovery.

Data science lifecylce

The point isn’t really the exact steps, but rather the notion of an iterative process.

Starting with a question

The scaling of brains with bodies is thought to contain clues about evolutionary patterns pertaining to intelligence.

There are lots of datasets out there with brain and body weight measurements, so let’s consider the question:

What is the relationship between an animal’s brain and body weight?

Data acquisition

From Allison et al. 1976, average body and brain weights for 62 mammals.

	species	body_wt	brain_wt
0	Africanelephant	6654.000	5712.0
1	Africangiantpouchedrat	1.000	6.6
2	ArcticFox	3.385	44.5

Units of measurement

body weight in kilograms
brain weight in grams

Data assessment

How well-matched is the data to our question?

Mammals only (no birds, fish, reptiles, etc.)
Species are those for which convenient specimens were available
Averages across specimens are reported (‘aggregated’ data)

What do you think? Take a moment to discuss with your neighbor.

Data assessment

Based on the great points you just made, we really only stand to learn something about this particular sample of animals.

In other words, no inference is possible.

Do you think the data are still useful?

Inpection

This dataset is already impeccably neat: each row is an observation for some species of mammal, and the columns are the two variables (average weight).

So no tidying needed – we’ll just check the dimensions and see if any values are missing.

# dimensions?
bb_weights.shape

(62, 3)

# missing values?
bb_weights.isna().sum(axis = 0)

species     0
body_wt     0
brain_wt    0
dtype: int64

Exploration

Visualization is usually a good starting point for exploring data.

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Notice the apparent density of points near – that suggests we shouldn’t look for a relationship on the scale of kg/g.

Exploration

A simple transformation of the axes reveals a clearer pattern.

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Analysis

The plot shows us that there’s a roughly linear relationship on the log scale:

So what does that mean in terms of brain and body weights? A little algebra and we have a “power law”:

Check your understanding: what’s the proportionality constant?

Interpretation

So it appears that the brain-body scaling is well-described by a power law:

among selected specimens of these 62 species of mammal, species average brain weight is approximately proportional to a power of species average body weight

Notice that I did not say:

animals’ brains are proportional to a power of their bodies
among these 62 mammals, average brain weight is approximately proportional to a power of average body weight

Question refinement

We can now ask further, more specific questions:

Do other types of animals exhibit the same power law relationship?

To investigate, we need richer data.

(More) data acquisition

A number of authors have compiled and published ‘meta-analysis’ datasets by combining the results of multiple studies.

Below we’ll import a few of these for three different animal classes.

# import metaanalysis datasets
reptiles = pd.read_csv('data/reptile_meta.csv')
birds = pd.read_csv('data/bird_meta.csv', encoding = 'latin1')
mammals = pd.read_csv('data/mammal_meta.csv', encoding = 'latin1')

Data assessment

Where does this data come from? It’s kind of a convenience sample of scientific data:

Multiple studies possibly different sampling and measurement protocols
Criteria for inclusion unknown probably neither comprehensive nor representative of all such measurements taken

So these data, while richer, are still relatively narrow in terms of generalizability.

A comment on scope of inference

These data don’t support general inferences (e.g., to all animals, all mammals, etc.) because they weren’t collected for the purpose to which we’re putting them.

Usually, if data are not collected for the explicit purpose of the question you’re trying to answer, they won’t constitute a representative sample.

Tidying

Back to the task at hand, in order to comine the datasets one must:

Select columns of interest;
Put in consistent order;
Give consistent names;
Concatenate row-wise.

We’ll skip the details for now.

Inspection

This dataset has quite a lot of missing brain weight measurements: many of the studies combined to form these datasets did not include that particular measurement.

# missing values?
data.isna().mean(axis = 0)

Order      0.00000
Family     0.00000
Genus      0.00000
Species    0.00000
Sex        0.00000
body       0.00000
brain      0.57404
class      0.00000
dtype: float64

Exploration

Focusing on the nonmissing values, we see the same power law relationship but with different proportionality constants and exponents for the three classes of animals.

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Analysis

So we might hypothesize that:

Interpretation

It seems that the average brain and body weights of the birds, mammals, and reptiles measured in these studies exhibit distinct power law relationships.

What would you investigate next?

Correlates of body weight?
Adjust for lifespan, habitat, predation, etc.?
Estimate the ’s and ’s?
Predict brain weights for unobserved species?
Something else?

A comment

Notice that I did not mention the word ‘model’ anywhere!

This was intentional – it is a common misconception that analyzing data always involves fitting models.

Models are not not always necessary or appropriate
You can learn a lot from exploratory techniques
Models approximate specific kinds of relationships in data
Exploratory analysis can reveal unexpected structure

Model limitations

Back to the issue of representativeness:

shouldn’t use this model for inferences
might not be reliable for prediction either
but does capture/convey some suggestive comparisons

So, just be careful with interpretation of results:

“For this particular collection of specimens, we estimated…”

Zooming out

This example illustrates the aspects of the lifecylce we’ll cover in this class:

data retrieval and import
tidying and transformation
visualization
exploratory analysis
modeling

We’ll address these topics in sequence.

Next week

Tabular data structure
Data semantics
Tidy data
Transformations of tabular data
Aggregation and grouping

	coef	std err	t	P>\|t\|	[0.025	0.975]
Bird	-1.9574	0.040	-49.118	0.000	-2.036	-1.879
Mammal	-2.9391	0.029	-100.061	0.000	-2.997	-2.882
Reptile	-4.0335	0.083	-48.577	0.000	-4.196	-3.871
log.body.bird	0.5653	0.008	66.566	0.000	0.549	0.582
log.body.mammal	0.7651	0.004	191.544	0.000	0.757	0.773
log.body.reptile	0.5293	0.017	31.375	0.000	0.496	0.562