body_wt | brain_wt | |
---|---|---|
species | ||
Africanelephant | 6654.0 | 5712.0 |
Africangiantpouchedrat | 1.0 | 6.6 |
Invalid Date
fruit_info
section) of Lab 1 before sectionMost data are stored in tables, but there are always multiple possible tabular layouts for the same underlying data.
Let’s look at some examples.
Below is the Allison 1976 mammal brain-body weight dataset from last time shown in two ‘long’ layouts:
body_wt | brain_wt | |
---|---|---|
species | ||
Africanelephant | 6654.0 | 5712.0 |
Africangiantpouchedrat | 1.0 | 6.6 |
measurement | weight | |
---|---|---|
species | ||
Africanelephant | brain_wt | 5712.0 |
Africanelephant | body_wt | 6654.0 |
Africangiantpouchedrat | brain_wt | 6.6 |
Africangiantpouchedrat | body_wt | 1.0 |
Here’s a third possible layout for the mammal brain-body weight data:
species | Africanelephant | Africangiantpouchedrat | ArcticFox | Arcticgroundsquirrel |
---|---|---|---|---|
measurement | ||||
body_wt | 6654.0 | 1.0 | 3.385 | 0.92 |
brain_wt | 5712.0 | 6.6 | 44.500 | 5.70 |
Here’s another example: World Bank data on annual GDP growth for 264 countries from 1961 – 2019.
Country Name | Country Code | 2009 | 2010 | 2011 | |
---|---|---|---|---|---|
0 | Aruba | ABW | -10.519749 | -3.685029 | 3.446055 |
1 | Afghanistan | AFG | 21.390528 | 14.362441 | 0.426355 |
2 | Angola | AGO | 0.858713 | 4.403933 | 3.471976 |
3 | Albania | ALB | 3.350067 | 3.706892 | 2.545322 |
4 | Andorra | AND | -5.302847 | -1.974958 | -0.008070 |
Here’s an alternative layout for the annual GDP growth data:
year | growth_pct | |
---|---|---|
Country Name | ||
Afghanistan | 2009 | 21.390528 |
Aruba | 2009 | -10.519749 |
Afghanistan | 2010 | 14.362441 |
Aruba | 2010 | -3.685029 |
Afghanistan | 2011 | 0.426355 |
Aruba | 2011 | 3.446055 |
A third example: daily minimum and maximum temperatures recorded at Santa Barbara Municipal Airport from January 2021 through March 2021.
STATION | TMAX | TMIN | MONTH | DAY | YEAR | |
---|---|---|---|---|---|---|
0 | USW00023190 | 65 | 37 | 1 | 1 | 2021 |
1 | USW00023190 | 62 | 38 | 1 | 2 | 2021 |
2 | USW00023190 | 60 | 42 | 1 | 3 | 2021 |
Here’s a wide layout for the SB weather data:
DAY | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
MONTH | type | ||||
1 | TMAX | 65.0 | 62.0 | 60.0 | 72.0 |
TMIN | 37.0 | 38.0 | 42.0 | 43.0 | |
2 | TMAX | 66.0 | 67.0 | 69.0 | 63.0 |
TMIN | 45.0 | 40.0 | 44.0 | 37.0 | |
3 | TMAX | 68.0 | 66.0 | 59.0 | 62.0 |
A final example: United Nations country development data organized into different tables according to variable type.
Here is a table of population measurements:
total_pop | urban_pct_pop | pop_under5 | pop_15to64 | pop_over65 | |
---|---|---|---|---|---|
country | |||||
Afghanistan | 38.0 | 25.8 | 5.6 | 20.9 | 1.0 |
Albania | 2.9 | 61.2 | 0.2 | 2.0 | 0.4 |
And here is a table of a few gender-related variables:
gender_inequality | parliament_pct_women | labor_participation_women | labor_participation_men | |
---|---|---|---|---|
country | ||||
Norway | 0.045 | 40.8 | 60.4 | 67.2 |
Ireland | 0.093 | 24.3 | 56.0 | 68.4 |
Here are both tables merged by country:
total_pop | urban_pct_pop | pop_under5 | pop_15to64 | pop_over65 | gender_inequality | parliament_pct_women | labor_participation_women | labor_participation_men | |
---|---|---|---|---|---|---|---|---|---|
country | |||||||||
Afghanistan | 38.0 | 25.8 | 5.6 | 20.9 | 1.0 | 0.655 | 27.2 | 21.6 | 74.7 |
Albania | 2.9 | 61.2 | 0.2 | 2.0 | 0.4 | 0.181 | 29.5 | 46.7 | 64.6 |
Algeria | 43.1 | 73.2 | 5.0 | 27.1 | 2.8 | 0.429 | 21.5 | 14.6 | 67.4 |
And here is another arrangement of the merged table:
gender_variable | gender_value | population_variable | population_value | |
---|---|---|---|---|
country | ||||
Afghanistan | gender_inequality | 0.655 | total_pop | 38.0 |
Albania | gender_inequality | 0.181 | total_pop | 2.9 |
Algeria | gender_inequality | 0.429 | total_pop | 43.1 |
Andorra | gender_inequality | NaN | total_pop | 0.1 |
Angola | gender_inequality | 0.536 | total_pop | 31.8 |
In short, the alternate layouts differ in three respects:
Return to one of the examples and review the different layouts with your neighbor.
It’s surprisingly difficult to articulate reasons why one layout might be preferable to another.
As a result:
Because of the wide range of possible layouts for a dataset, and the variety of choices that are made about how to store data, data scientists are constantly faced with determining how best to reorganize datasets in a way that facilitates exploration and analysis.
Broadly, this involves two interdependent choices:
The tidy data standard is a principled way of organizing tabular data. It has two main advantages:
“Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored.” Wickham and Grolemund, R for Data Science, 2017.
A dataset is a collection of values.
the semantics of a dataset are the meanings of the values
the structure of a dataset is the arrangement of the values
To introduce some general vocabulary, each value in a dataset is
total_pop | urban_pct_pop | |
---|---|---|
country | ||
Afghanistan | 38.0 | 25.8 |
Albania | 2.9 | 61.2 |
Let’s do an example. Here’s one record from the GDP growth data:
year | growth_pct | |
---|---|---|
Country Name | ||
Afghanistan | 2010 | 14.362441 |
Above, the values -13.605441 and 1961 are observations of the variables GDP growth and year recorded for the observational unit Algeria.
What are the units, variables and observations?
DAY | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
MONTH | type | ||||
1 | TMAX | 65.0 | 62.0 | 60.0 | 72.0 |
TMIN | 37.0 | 38.0 | 42.0 | 43.0 | |
2 | TMAX | 66.0 | 67.0 | 69.0 | 63.0 |
TMIN | 45.0 | 40.0 | 44.0 | 37.0 |
Think about it, then confer with your neighbor.
Data structure refers to the form in which it is stored.
Tabular data is arranged in rows and columns.
As we saw, there are multiple structures – arrangements of rows and columns – available to represent any dataset.
The tidy standard consists in matching semantics and structure. A dataset is tidy if:
Let’s revisit some of our examples of multiple layouts.
Country Name | Country Code | 2009 | 2010 | 2011 | |
---|---|---|---|---|---|
0 | Aruba | ABW | -10.519749 | -3.685029 | 3.446055 |
1 | Afghanistan | AFG | 21.390528 | 14.362441 | 0.426355 |
2 | Angola | AGO | 0.858713 | 4.403933 | 3.471976 |
We can compare the semantics and structure for alignment:
Semantics | Structure | ||
---|---|---|---|
Observations | Annual records | Rows | Countries |
Variables | GDP growth and year | Columns | Value of year |
Observational units | Countries | Tables | Just one |
Rules 1 and 2 are violated, since column names are values (of year), not variables. Not tidy.
year | growth_pct | |
---|---|---|
Country Name | ||
Afghanistan | 2009 | 21.390528 |
Aruba | 2009 | -10.519749 |
Afghanistan | 2010 | 14.362441 |
Aruba | 2010 | -3.685029 |
Comparison of semantics and structure:
Semantics | Structure | ||
---|---|---|---|
Observations | Annual records | Rows | Annual records |
Variables | GDP growth and year | Columns | GDP growth and year |
Observational units | Countries | Tables | Just one |
All three rules are met: rows are observations, columns are variables, and there’s one unit type and one table. Tidy.
STATION | TMAX | TMIN | MONTH | DAY | YEAR | |
---|---|---|---|---|---|---|
0 | USW00023190 | 65 | 37 | 1 | 1 | 2021 |
1 | USW00023190 | 62 | 38 | 1 | 2 | 2021 |
2 | USW00023190 | 60 | 42 | 1 | 3 | 2021 |
Try this one on your own. Then compare with your neighbor.
In undev1
and undev2
:
total_pop | urban_pct_pop | pop_under5 | pop_15to64 | pop_over65 | |
---|---|---|---|---|---|
country | |||||
Afghanistan | 38.0 | 25.8 | 5.6 | 20.9 | 1.0 |
Albania | 2.9 | 61.2 | 0.2 | 2.0 | 0.4 |
gender_inequality | parliament_pct_women | labor_participation_women | labor_participation_men | |
---|---|---|---|---|
country | ||||
Norway | 0.045 | 40.8 | 60.4 | 67.2 |
Ireland | 0.093 | 24.3 | 56.0 | 68.4 |
Here there are multiple tables. To discuss:
“Well, here’s another nice mess you’ve gotten me into” – Oliver Hardy
These examples illustrate some common messes:
These common messes can be cleaned up by some simple operations:
Melting resolves the problem of having values stored as columns (common mess 1).
Country Name | Country Code | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aruba | ABW | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | -3.685029 | 3.446055 | -1.369863 | 4.198232 | 0.300000 | 5.700001 | 2.100000 | 1.999999 | NaN | NaN |
1 | Afghanistan | AFG | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 14.362441 | 0.426355 | 12.752287 | 5.600745 | 2.724543 | 1.451315 | 2.260314 | 2.647003 | 1.189228 | 3.911603 |
2 rows × 61 columns
# in pandas
gdp1.melt(
id_vars = ['Country Name', 'Country Code'], # which variables do you want to retain for each row? .
var_name = 'Year', # what do you want to name the variable that will contain the column names?
value_name = 'GDP Growth', # what do you want to name the variable that will contain the values?
).head(2)
Country Name | Country Code | Year | GDP Growth | |
---|---|---|---|---|
0 | Aruba | ABW | 1961 | NaN |
1 | Afghanistan | AFG | 1961 | NaN |
Pivoting resolves the issue of having multiple variables stored in one column (common mess 2). It’s the inverse operation of melting.
measurement | weight | |
---|---|---|
species | ||
Africanelephant | brain_wt | 5712.0 |
Africanelephant | body_wt | 6654.0 |
Africangiantpouchedrat | brain_wt | 6.6 |
Africangiantpouchedrat | body_wt | 1.0 |
# in pandas
mammal2.pivot(
columns = 'measurement', # which variable(s) do you want to send to new column names?
values = 'weight' # which variable(s) do you want to use to populate the new columns?
).head(2)
measurement | body_wt | brain_wt |
---|---|---|
species | ||
Africanelephant | 6654.0 | 5712.0 |
Africangiantpouchedrat | 1.0 | 6.6 |
Common mess 3 is a combination of messes 1 and 2: values or variables are stored in both rows and columns. Pivoting and melting in sequence can usually fix this.
DAY | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MONTH | type | |||||||||||||||||||||
1 | TMAX | 65.0 | 62.0 | 60.0 | 72.0 | 61.0 | 71.0 | 73.0 | 79.0 | 71.0 | 67.0 | ... | 61.0 | 59.0 | 65.0 | 55.0 | 57.0 | 54.0 | 55.0 | 55.0 | 58.0 | 63.0 |
TMIN | 37.0 | 38.0 | 42.0 | 43.0 | 40.0 | 39.0 | 38.0 | 36.0 | 39.0 | 37.0 | ... | 41.0 | 40.0 | 38.0 | 44.0 | 40.0 | 48.0 | 49.0 | 42.0 | 37.0 | 37.0 | |
2 | TMAX | 66.0 | 67.0 | 69.0 | 63.0 | 66.0 | 68.0 | 60.0 | 57.0 | 59.0 | 61.0 | ... | 75.0 | 75.0 | 70.0 | 66.0 | 69.0 | 76.0 | 68.0 | NaN | NaN | NaN |
TMIN | 45.0 | 40.0 | 44.0 | 37.0 | 38.0 | 38.0 | 38.0 | 49.0 | 49.0 | 41.0 | ... | 37.0 | 39.0 | 41.0 | 39.0 | 36.0 | 43.0 | 38.0 | NaN | NaN | NaN | |
3 | TMAX | 68.0 | 66.0 | 59.0 | 62.0 | 67.0 | 69.0 | 60.0 | 69.0 | 65.0 | 58.0 | ... | 71.0 | 72.0 | 67.0 | 65.0 | 63.0 | 72.0 | 73.0 | 77.0 | NaN | NaN |
TMIN | 37.0 | 36.0 | 36.0 | 37.0 | 39.0 | 43.0 | 47.0 | 47.0 | 47.0 | 43.0 | ... | 50.0 | 49.0 | 41.0 | 44.0 | 40.0 | 41.0 | 41.0 | 42.0 | NaN | NaN |
6 rows × 31 columns
weather3.melt(
ignore_index = False,
var_name = 'day',
value_name = 'temp'
).reset_index().pivot(
index = ['MONTH', 'day'],
columns = 'type',
values = 'temp'
).reset_index().rename_axis(columns = {'type': ''}).head()
MONTH | day | TMAX | TMIN | |
---|---|---|---|---|
0 | 1 | 1 | 65.0 | 37.0 |
1 | 1 | 2 | 62.0 | 38.0 |
2 | 1 | 3 | 60.0 | 42.0 |
3 | 1 | 4 | 72.0 | 43.0 |
4 | 1 | 5 | 61.0 | 40.0 |
Merging resolves the issue of storing observations or variables on one unit type in multiple tables (mess 4). The basic idea is to combine by matching rows.
The code below combines columns in each table by matching rows based on country.
total_pop | urban_pct_pop | pop_under5 | pop_15to64 | pop_over65 | gender_inequality | parliament_pct_women | labor_participation_women | labor_participation_men | |
---|---|---|---|---|---|---|---|---|---|
country | |||||||||
Afghanistan | 38.0 | 25.8 | 5.6 | 20.9 | 1.0 | 0.655 | 27.2 | 21.6 | 74.7 |
Albania | 2.9 | 61.2 | 0.2 | 2.0 | 0.4 | 0.181 | 29.5 | 46.7 | 64.6 |
Algeria | 43.1 | 73.2 | 5.0 | 27.1 | 2.8 | 0.429 | 21.5 | 14.6 | 67.4 |
Andorra | 0.1 | 88.0 | NaN | NaN | NaN | NaN | 46.4 | NaN | NaN |
There are various rules for exactly how to merge, but the general syntactical procedure to merge dataframes df1
and df2
is this.
Specify an order: merge(df1, df2)
or merge(df2, df1)
.
Specify keys: the shared columns to use for matching rows of df1
with rows of df2
.
date
will align rows in df2
with rows of df1
that have the same value for date
Specify a rule for which rows to return after merging
df1
, drop non-matching rows in df2
(‘left’ join)df2
drop non-matching rows in df1
(‘right’ join)df1
or df2
, inducing missing values (‘outer’ join)df1
and df2
(‘inner’ join)Transformations of tabular data