Dataframe Transformations

PSTAT100 Spring 2023

Invalid Date

Recap: tidy data

The tidy standard consists in matching semantics and structure.

A dataset is tidy if:

  1. Each variable is a column.
  2. Each observation is a row.
  3. Each table contains measurements on only one type of observational unit.

Why tidy?

Why use the tidy standard? Wouldn’t any system of organization do just as well?

The tidy standard has three main advantages:

  1. Having a consistent system of organization makes it easier to focus on analysis and exploration. (True of any system)
  2. Many software tools are designed to work with tidy data inputs. (Tidy only)
  3. Transformation of tidy data is especially natural in most computing environments due to vectorized operations. (Tidy only)

Transformations

Transformations of data frames are operations that modify the shape or values of a data frame. These include:

  • Slicing rows and columns by index
  • Filtering rows by logical conditions
  • Defining new variables from scratch or by operations on existing variables
  • Aggregations (min, mean, max, etc.)

Slicing

Slicing refers to retrieving a (usually contiguous) subset (a ‘slice’) of rows/columns from a data frame.

Uses:

  • data inspection/retrieval
  • subsetting for further analysis/manipulation
  • data display

Data display

Recall the UN Development data:

# preview UN data -- note indexed by country
undev.head(3)
total_pop urban_pct_pop pop_under5 pop_15to64 pop_over65 gender_inequality parliament_pct_women labor_participation_women labor_participation_men
country
Afghanistan 38.0 25.8 5.6 20.9 1.0 0.655 27.2 21.6 74.7
Albania 2.9 61.2 0.2 2.0 0.4 0.181 29.5 46.7 64.6
Algeria 43.1 73.2 5.0 27.1 2.8 0.429 21.5 14.6 67.4

Aside: .head() is a slicing operation – it returns the ‘top’ slice of rows.

Data inspection/retrieval

To inspect the percentage of women in parliament in Mexico, slice accordingly:

undev.loc[['Mexico'], ['parliament_pct_women']]
parliament_pct_women
country
Mexico 48.4

Review: .loc and .iloc

The primary slicing functions in pandas are

  • .loc (location) to slice by index
  • .iloc (integer location) to slice by position
# .iloc equivalent of previous slice
undev.iloc[[111], [6]]
parliament_pct_women
country
Mexico 48.4

Check your understanding: which row in the dataframe is the observation for Mexico?

If a single index rather than a list is provided – e.g., Mexico rather than [Mexico], – these functions will return the raw value as a float rather than a dataframe.

undev.loc['Mexico', 'parliament_pct_women']
48.4

Larger slices

More typically, a slice will be a contiguous chunk of rows and columns.

Slicing operations can interpret start:end as shorthand for a range of indices.

undev.loc['Mexico':'Mongolia', ['parliament_pct_women']]
parliament_pct_women
country
Mexico 48.4
Micronesia (Federated States of) 0.0
Moldova (Republic of) 25.7
Mongolia 17.3

Note: start:end is inclusive of both endpoints with .loc, but not inclusive of the right endpoint with .iloc. Get in the habit of double-checking results.

Defining new variables

Vectorization of operations in pandas and numpy make tidy data especially nice to manipulate mathematically. For example:

weather2['TRANGE'] = weather2.TMAX - weather2.TMIN
weather2.loc[0:3, ['TMAX', 'TMIN', 'TRANGE']]
TMAX TMIN TRANGE
0 65 37 28
1 62 38 24
2 60 42 18
3 72 43 29

This computes \(t_{min, i} - t_{max, i}\) for all observations \(i = 1, \dots, n\).

Check your understanding: express this calculation as a linear algebra arithmetic operation.

Your turn

Let’s take another example – consider this slice of the undev data:

total_pop urban_pct_pop
country
Afghanistan 38.0 25.8
Albania 2.9 61.2
Algeria 43.1 73.2

With your neighbor, write a line of code that calculates the percentage of the population living in rural areas.

Filtering

Filtering refers to removing a subset of rows based on one or more conditions. (Think of “filtering out” certain rows.)

For example, suppose we wanted to retrieve only the countries with populations exceeding 1Bn people:

undev[undev.total_pop > 1000]
total_pop urban_pct_pop pop_under5 pop_15to64 pop_over65 gender_inequality parliament_pct_women labor_participation_women labor_participation_men
country
China 1433.8 60.3 85.0 1014.0 164.5 0.168 24.9 60.5 75.3
India 1366.4 34.5 116.8 915.6 87.1 0.488 13.5 20.5 76.1

Filtering

Technically, filtering works by slicing according to a long logical vector with one entry per row specifying whether to retain (True) or drop (False).

undev.total_pop > 1000
country
Afghanistan                           False
Albania                               False
Algeria                               False
Andorra                               False
Angola                                False
                                      ...  
Venezuela (Bolivarian Republic of)    False
Viet Nam                              False
Yemen                                 False
Zambia                                False
Zimbabwe                              False
Name: total_pop, Length: 189, dtype: bool

A small puzzle

Consider a random filter:

random_filter = np.random.binomial(n = 1, p = 0.03, size = undev.shape[0]).astype('bool')

random_filter
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False])
  1. How many rows will undev[random_filter] have?
  2. How many rows should this random filtering produce on average?

Logical comparisons

Any of the following relations can be used to define filtering conditions

Symbol Usage Meaning
== a == b Does a equal b?
<= a <= b Is a less than or equal to b?
>= a >= b Is a greater than or equal to b?
< a < b Is a less than b?
> a > b Is a greater than b?
~ ~p Returns negation of p
| p | q p OR q
& p & q p AND q
^ p ^ q p XOR q (exclusive or)

Aggregation

Aggregation refers to any operation that combines many values into fewer values.

Common aggregation operations include:

  • summation \(\sum_{i} x_i\)
  • averaging \(n^{-1} \sum_i x_i\)
  • extrema \(\text{min}_i x_i\) and \(\text{max}_i x_i\)
  • statistics: median, variance, standard deviation, mean absolute deviation, order statistics, quantiles

Aggregation vs. other transformations

Aggregations reduce the number of values, whereas other transformations do not.

A bit more formally:

  • aggregations map larger sets of values to smaller sets of values
  • transformations map sets of values to sets of the same size

Check your understanding:

  • is \((f*g)(x_i) = \int f(h)g(x_i - h)dh\) an aggregation?
  • is \(f(x_1, x_2, \dots, x_n) = \left(\prod_i x_i\right)^{\frac{1}{n}}\) an aggregation?

Aggregation?

Gaussian blur.

Example aggregations

In numpy, the most common aggregations are implemented as functions:

numpy function
np.sum() \(\sum_i x_i\)
np.max() \(\text{max}(x_1, \dots, x_n)\)
np.min() \(\text{min}(x_1, \dots, x_n)\)
np.median() \(\text{median}(x_1, \dots, x_n)\)
np.mean() \(n^{-1}\sum_{i = 1}^n x_i\)
np.var() \((n - 1)^{-1}\sum_{i = 1}^n (x_i - \bar{x})^2\)
np.std() \(\sqrt{(n - 1)^{-1}\sum_{i = 1}^n (x_i - \bar{x})^2}\)
np.prod() \(\prod_i x_i\)
np.percentile() \(\hat{F}^{-1}(q)\)

Argmin and argmax

\(\text{argmax}_D f(x)\) refers to the value or values in the domain \(D\) of \(f\) at which the function attains its maximum – the argument in \(D\) maximizing \(f\).

Similarly, \(\text{argmax}_i x_i\) refers to the index (or indices, if ties) of the largest value in the set \(\{x_i\}\).

Check your understanding: what does the following return?

np.array([1, 5, 10, 2]).argmin()

Argmin and argmax

These index retrieval functions can be handy for slicing rows of interest.

For example, which country had the largest percentage of women in parliament in the year the UN development data was collected?

undev.index[undev.parliament_pct_women.argmax()]
'Rwanda'

And what were the observations?

undev.iloc[undev.parliament_pct_women.argmax(), :]
total_pop                    12.600
urban_pct_pop                17.300
pop_under5                    1.800
pop_15to64                    7.200
pop_over65                    0.400
gender_inequality             0.402
parliament_pct_women         55.700
labor_participation_women    83.900
labor_participation_men      83.400
Name: Rwanda, dtype: float64

Dataframe aggregations

In pandas, the numpy aggregation operations are available as dataframe methods that apply the corresponding operation over each column:

# mean of every column
undev.mean()
total_pop                    40.423810
urban_pct_pop                58.660847
pop_under5                    3.666120
pop_15to64                   27.250820
pop_over65                    3.797814
gender_inequality             0.344154
parliament_pct_women         23.093048
labor_participation_women    52.139888
labor_participation_men      72.470787
dtype: float64

Row-wise aggregation

In general, supplying the argument axis = 1 will compute rowwise aggregations. For example:

# sum `pop_under5`, `pop_15to64`, and `pop_over65`
undev.iloc[:, 2:5].sum(axis = 1).head(3)
country
Afghanistan    27.5
Albania         2.6
Algeria        34.9
dtype: float64

This facilitates, for example:

undev['pop_5to14'] = undev.total_pop - undev.iloc[:, 2:5].sum(axis = 1)

Argmin/idxmin and argmax/idxmax

In pandas, np.argmin() and np.argmax() are implemented as pd.df.idxmin() and pd.df.idxmax().

undev.idxmax()
total_pop                                     China
urban_pct_pop                Hong Kong, China (SAR)
pop_under5                                    India
pop_15to64                                    China
pop_over65                                    China
gender_inequality                             Yemen
parliament_pct_women                         Rwanda
labor_participation_women                    Rwanda
labor_participation_men                       Qatar
pop_5to14                                     India
dtype: object

Other functions

Pandas has a wide array of other aggregation and transformation functions. To show just one example:

## slice weather data
weather4 = weather1.set_index('DATE').iloc[:, 2:4]
weather4.head(2)
TMAX TMIN
DATE
1/1/2021 65 37
1/2/2021 62 38
# rolling average
weather4.rolling(window = 7).mean().head(10)
TMAX TMIN
DATE
1/1/2021 NaN NaN
1/2/2021 NaN NaN
1/3/2021 NaN NaN
1/4/2021 NaN NaN
1/5/2021 NaN NaN
1/6/2021 NaN NaN
1/7/2021 66.285714 39.571429
1/8/2021 68.285714 39.428571
1/9/2021 69.571429 39.571429
1/10/2021 70.571429 38.857143

Check your understanding

Interpret this result:

weather4.rolling(window = 7).mean().idxmax()
TMAX    1/20/2021
TMIN    3/24/2021
dtype: object

(The weather data is January through March.)

Custom functions

See the documentation for a comprehensive list of transformations and aggregations.

If pandas doesn’t have a method for an operation you’re wanting to perform, you can implement custom transformations/aggregations with:

  • pd.df.apply() or pd.df.transform() apply a function row-wise or column-wise
  • pd.df.agg() or pd.df.aggregate()

Custom functions

Here’s an example:

1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 ... 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Country Name
Argentina 5.427843 -0.852022 -5.308197 10.130298 10.569433 -0.659726 3.191997 4.822501 9.679526 3.045643 ... 10.125398 6.003952 -1.026420 2.405324 -2.512615 2.731160 -2.080328 2.818503 -2.565352 -2.088015
Australia 2.485769 1.296087 6.214630 6.978522 5.983506 2.382458 6.302620 5.095814 7.044329 7.172187 ... 2.067417 2.462756 3.918163 2.584898 2.533115 2.192647 2.770652 2.300611 2.949286 2.160956

2 rows × 59 columns

# convert percentages to proportions
gdp_prop = gdp.transform(lambda x: x/100 + 1)

# compute geometric mean
gdp_prop.aggregate(
    lambda x: np.prod(x)**(1/len(x)), 
    axis = 1).head(4)
Country Name
Argentina    1.022831
Australia    1.034228
Austria      1.027254
Burundi      1.023854
dtype: float64

Your turn

Here’s the country with the highest annualized GDP growth for the period 1961-2019:

Country Name
Botswana    1.079442
dtype: float64

How did I find this? Suppose that the result on the previous slide were stored as gdp_annualized. Write a line of code that generates the result shown above.

Grouped aggregations

Suppose we wanted to compute annualized growth by decade for each country.

To do so, we’d compute the same aggregation (geometric mean) repeatedly for subsets of data values. This is called a grouped aggregation.

Usually, one defines a grouping of dataframe rows using columns in the dataset. For example:

gdp_decades.head(4)
Country Name growth decade
0 Argentina 1.054278 1960
1 Australia 1.024858 1960
2 Austria 1.055380 1960
3 Burundi 0.862539 1960

How should the rows be grouped?

.groupby

In pandas, df.groupby('COLUMN') defines a grouping of dataframe rows in which each group is a set of rows with the same value of 'COLUMN'.

  • There will be exactly as many groups as the number of unique values in 'COLUMN'.
  • Multiple columns may be specified to define a grouping, e.g., df.groupby(['COL1', 'COL2'])
  • Subsequent operations will be performed group-wise

Annualized GDP growth by decade

Returning to our example:

gdp_anngrowth = gdp_decades.groupby(
    ['Country Name', 'decade']
    ).aggregate(
    lambda x: np.prod(x)**(1/len(x))
    )

gdp_anngrowth
growth
Country Name decade
Algeria 1960 1.030579
1970 1.068009
1980 1.027661
1990 1.015431
2000 1.038750
... ... ...
Zimbabwe 1970 1.038505
1980 1.051066
1990 1.027630
2000 0.944765
2010 1.055855

714 rows × 1 columns

Your turn

How do you find the country with the highest annualized GDP growth for each decade?

Write a line of code that would perform this calculation.

gdp_anngrowth...
growth
decade
1960 (Iran, Islamic Rep., 1960)
1970 (Botswana, 1970)
1980 (Botswana, 1980)
1990 (China, 1990)
2000 (Myanmar, 2000)
2010 (China, 2010)

Recap

  • In tidy data, rows and columns correspond to observations and variables.
    • This provides a standard dataset structure that facilitates exploration and analysis.
    • Many datasets are not stored in this format.
    • Transformation operations are a lot easier with tidy data, due in part to the way tools in pandas are designed.
  • Transformations are operations that modify the shape or values of dataframes. We discussed
    • slicing
    • filtering
    • creating new variables
    • aggregations (mean, min, max, argmin, etc.)
    • grouped aggregations
  • Dataframe manipulations will be used throughout the course to tidy up data and perform various inspections and summaries.

Up next

We started en media res at this stage of the lifecyle (tidy) so that you could start developing skills that would enable you to jump right into playing with datasets.

Next week, we’ll backtrack to the data collection and assessment stages of a project and discuss:

  • sampling
  • scope of inference
  • data assessment
  • missing data