species | body_wt | brain_wt | |
---|---|---|---|
0 | Africanelephant | 6654.000 | 5712.0 |
1 | Africangiantpouchedrat | 1.000 | 6.6 |
2 | ArcticFox | 3.385 | 44.5 |
Invalid Date
Data science is a term of art encompassing a wide range of activities that involve uncovering insights from quantitative information.
People that refer to themselves as data scientists typically combine specific interests (“domain knowledge”, e.g., biology) with computation, mathematics, and statistics and probability to contribute to knowledge in their communities.
Data science lifecycle: an end-to-end process resulting in a data analysis product
These form a cycle in the sense that the steps are iterated for question refinement and futher discovery.
The point isn’t really the exact steps, but rather the notion of an iterative process.
The scaling of brains with bodies is thought to contain clues about evolutionary patterns pertaining to intelligence.
There are lots of datasets out there with brain and body weight measurements, so let’s consider the question:
What is the relationship between an animal’s brain and body weight?
From Allison et al. 1976, average body and brain weights for 62 mammals.
species | body_wt | brain_wt | |
---|---|---|---|
0 | Africanelephant | 6654.000 | 5712.0 |
1 | Africangiantpouchedrat | 1.000 | 6.6 |
2 | ArcticFox | 3.385 | 44.5 |
Units of measurement
How well-matched is the data to our question?
What do you think? Take a moment to discuss with your neighbor.
Based on the great points you just made, we really only stand to learn something about this particular sample of animals.
In other words, no inference is possible.
Do you think the data are still useful?
This dataset is already impeccably neat: each row is an observation for some species of mammal, and the columns are the two variables (average weight).
So no tidying needed – we’ll just check the dimensions and see if any values are missing.
Visualization is usually a good starting point for exploring data.
Notice the apparent density of points near \((0, 0)\) – that suggests we shouldn’t look for a relationship on the scale of kg/g.
A simple transformation of the axes reveals a clearer pattern.
The plot shows us that there’s a roughly linear relationship on the log scale:
\[\log(\text{brain}) = \alpha \log(\text{body}) + c\]
So what does that mean in terms of brain and body weights? A little algebra and we have a “power law”:
\[(\text{brain}) \propto (\text{body})^\alpha\]
Check your understanding: what’s the proportionality constant?
So it appears that the brain-body scaling is well-described by a power law:
among selected specimens of these 62 species of mammal, species average brain weight is approximately proportional to a power of species average body weight
Notice that I did not say:
We can now ask further, more specific questions:
Do other types of animals exhibit the same power law relationship?
To investigate, we need richer data.
A number of authors have compiled and published ‘meta-analysis’ datasets by combining the results of multiple studies.
Below we’ll import a few of these for three different animal classes.
Where does this data come from? It’s kind of a convenience sample of scientific data:
So these data, while richer, are still relatively narrow in terms of generalizability.
These data don’t support general inferences (e.g., to all animals, all mammals, etc.) because they weren’t collected for the purpose to which we’re putting them.
Usually, if data are not collected for the explicit purpose of the question you’re trying to answer, they won’t constitute a representative sample.
Back to the task at hand, in order to comine the datasets one must:
We’ll skip the details for now.
This dataset has quite a lot of missing brain weight measurements: many of the studies combined to form these datasets did not include that particular measurement.
Focusing on the nonmissing values, we see the same power law relationship but with different proportionality constants and exponents for the three classes of animals.
So we might hypothesize that:
\[ (\text{brain}) = \beta_1(\text{body})^{\alpha_1} \qquad \text{(mammal)} \\ (\text{brain}) = \beta_2(\text{body})^{\alpha_2} \qquad \text{(reptile)} \\ (\text{brain}) = \beta_3(\text{body})^{\alpha_3} \qquad \text{(bird)} \\ \beta_i \neq \beta_j, \alpha_i \neq \alpha_j \quad \text{for } i \neq j \]
It seems that the average brain and body weights of the birds, mammals, and reptiles measured in these studies exhibit distinct power law relationships.
What would you investigate next?
Notice that I did not mention the word ‘model’ anywhere!
This was intentional – it is a common misconception that analyzing data always involves fitting models.
\((\text{brain}) = \beta_j(\text{body})^{\alpha_j} \quad \text{animal class } j = 1, 2, 3\)
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Bird | -1.9574 | 0.040 | -49.118 | 0.000 | -2.036 | -1.879 |
Mammal | -2.9391 | 0.029 | -100.061 | 0.000 | -2.997 | -2.882 |
Reptile | -4.0335 | 0.083 | -48.577 | 0.000 | -4.196 | -3.871 |
log.body.bird | 0.5653 | 0.008 | 66.566 | 0.000 | 0.549 | 0.582 |
log.body.mammal | 0.7651 | 0.004 | 191.544 | 0.000 | 0.757 | 0.773 |
log.body.reptile | 0.5293 | 0.017 | 31.375 | 0.000 | 0.496 | 0.562 |
Back to the issue of representativeness:
So, just be careful with interpretation of results:
“For this particular collection of specimens, we estimated…”
This example illustrates the aspects of the lifecylce we’ll cover in this class:
We’ll address these topics in sequence.