Dataframe Transformations

PSTAT100 Spring 2023

Invalid Date

Recap: tidy data

The tidy standard consists in matching semantics and structure.

A dataset is tidy if:

Each variable is a column.
Each observation is a row.
Each table contains measurements on only one type of observational unit.

Why tidy?

Why use the tidy standard? Wouldn’t any system of organization do just as well?

The tidy standard has three main advantages:

Having a consistent system of organization makes it easier to focus on analysis and exploration. (True of any system)
Many software tools are designed to work with tidy data inputs. (Tidy only)
Transformation of tidy data is especially natural in most computing environments due to vectorized operations. (Tidy only)

Transformations

Transformations of data frames are operations that modify the shape or values of a data frame. These include:

Slicing rows and columns by index
Filtering rows by logical conditions
Defining new variables from scratch or by operations on existing variables
Aggregations (min, mean, max, etc.)

Slicing

Slicing refers to retrieving a (usually contiguous) subset (a ‘slice’) of rows/columns from a data frame.

Uses:

data inspection/retrieval
subsetting for further analysis/manipulation
data display

Data display

Recall the UN Development data:

# preview UN data -- note indexed by country
undev.head(3)

	total_pop	urban_pct_pop	pop_under5	pop_15to64	pop_over65	gender_inequality	parliament_pct_women	labor_participation_women	labor_participation_men
country
Afghanistan	38.0	25.8	5.6	20.9	1.0	0.655	27.2	21.6	74.7
Albania	2.9	61.2	0.2	2.0	0.4	0.181	29.5	46.7	64.6
Algeria	43.1	73.2	5.0	27.1	2.8	0.429	21.5	14.6	67.4

Aside: .head() is a slicing operation – it returns the ‘top’ slice of rows.

Data inspection/retrieval

To inspect the percentage of women in parliament in Mexico, slice accordingly:

undev.loc[['Mexico'], ['parliament_pct_women']]

	parliament_pct_women
country
Mexico	48.4

Review: `.loc` and `.iloc`

The primary slicing functions in pandas are

.loc (location) to slice by index
.iloc (integer location) to slice by position

# .iloc equivalent of previous slice
undev.iloc[[111], [6]]

	parliament_pct_women
country
Mexico	48.4

Check your understanding: which row in the dataframe is the observation for Mexico?

If a single index rather than a list is provided – e.g., Mexico rather than [Mexico], – these functions will return the raw value as a float rather than a dataframe.

undev.loc['Mexico', 'parliament_pct_women']

48.4

Larger slices

More typically, a slice will be a contiguous chunk of rows and columns.

Slicing operations can interpret start:end as shorthand for a range of indices.

undev.loc['Mexico':'Mongolia', ['parliament_pct_women']]

	parliament_pct_women
country
Mexico	48.4
Micronesia (Federated States of)	0.0
Moldova (Republic of)	25.7
Mongolia	17.3

Note: start:end is inclusive of both endpoints with .loc, but not inclusive of the right endpoint with .iloc. Get in the habit of double-checking results.

Defining new variables

Vectorization of operations in pandas and numpy make tidy data especially nice to manipulate mathematically. For example:

weather2['TRANGE'] = weather2.TMAX - weather2.TMIN
weather2.loc[0:3, ['TMAX', 'TMIN', 'TRANGE']]

	TMAX	TMIN	TRANGE
0	65	37	28
1	62	38	24
2	60	42	18
3	72	43	29

This computes \(t_{min, i} - t_{max, i}\) for all observations \(i = 1, \dots, n\).

Check your understanding: express this calculation as a linear algebra arithmetic operation.

Your turn

Let’s take another example – consider this slice of the undev data:

	total_pop	urban_pct_pop
country
Afghanistan	38.0	25.8
Albania	2.9	61.2
Algeria	43.1	73.2

With your neighbor, write a line of code that calculates the percentage of the population living in rural areas.

Filtering

Filtering refers to removing a subset of rows based on one or more conditions. (Think of “filtering out” certain rows.)

For example, suppose we wanted to retrieve only the countries with populations exceeding 1Bn people:

undev[undev.total_pop > 1000]

	total_pop	urban_pct_pop	pop_under5	pop_15to64	pop_over65	gender_inequality	parliament_pct_women	labor_participation_women	labor_participation_men
country
China	1433.8	60.3	85.0	1014.0	164.5	0.168	24.9	60.5	75.3
India	1366.4	34.5	116.8	915.6	87.1	0.488	13.5	20.5	76.1

Filtering

Technically, filtering works by slicing according to a long logical vector with one entry per row specifying whether to retain (True) or drop (False).

undev.total_pop > 1000

country
Afghanistan                           False
Albania                               False
Algeria                               False
Andorra                               False
Angola                                False
                                      ...  
Venezuela (Bolivarian Republic of)    False
Viet Nam                              False
Yemen                                 False
Zambia                                False
Zimbabwe                              False
Name: total_pop, Length: 189, dtype: bool

A small puzzle

Consider a random filter:

random_filter = np.random.binomial(n = 1, p = 0.03, size = undev.shape[0]).astype('bool')

random_filter

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True,  True, False,
       False, False, False, False, False, False, False, False,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False,  True])

How many rows will undev[random_filter] have?
How many rows should this random filtering produce on average?

Logical comparisons

Any of the following relations can be used to define filtering conditions

Symbol	Usage	Meaning
`==`	`a == b`	Does a equal b?
`<=`	`a <= b`	Is a less than or equal to b?
`>=`	`a >= b`	Is a greater than or equal to b?
`<`	`a < b`	Is a less than b?
`>`	`a > b`	Is a greater than b?
`~`	`~p`	Returns negation of p
`\|`	`p` `\|` `q`	p OR q
`&`	`p & q`	p AND q
`^`	`p ^ q`	p XOR q (exclusive or)

Aggregation

Aggregation refers to any operation that combines many values into fewer values.

Common aggregation operations include:

summation \(\sum_{i} x_i\)
averaging \(n^{-1} \sum_i x_i\)
extrema \(\text{min}_i x_i\) and \(\text{max}_i x_i\)
statistics: median, variance, standard deviation, mean absolute deviation, order statistics, quantiles

Aggregation vs. other transformations

Aggregations reduce the number of values, whereas other transformations do not.

A bit more formally:

aggregations map larger sets of values to smaller sets of values
transformations map sets of values to sets of the same size

Check your understanding:

is \((f*g)(x_i) = \int f(h)g(x_i - h)dh\) an aggregation?
is \(f(x_1, x_2, \dots, x_n) = \left(\prod_i x_i\right)^{\frac{1}{n}}\) an aggregation?

Aggregation?

Gaussian blur.

Example aggregations

In numpy, the most common aggregations are implemented as functions:

numpy	function
`np.sum()`	\(\sum_i x_i\)
`np.max()`	\(\text{max}(x_1, \dots, x_n)\)
`np.min()`	\(\text{min}(x_1, \dots, x_n)\)
`np.median()`	\(\text{median}(x_1, \dots, x_n)\)
`np.mean()`	\(n^{-1}\sum_{i = 1}^n x_i\)
`np.var()`	\((n - 1)^{-1}\sum_{i = 1}^n (x_i - \bar{x})^2\)
`np.std()`	\(\sqrt{(n - 1)^{-1}\sum_{i = 1}^n (x_i - \bar{x})^2}\)
`np.prod()`	\(\prod_i x_i\)
`np.percentile()`	\(\hat{F}^{-1}(q)\)

Argmin and argmax

\(\text{argmax}_D f(x)\) refers to the value or values in the domain \(D\) of \(f\) at which the function attains its maximum – the argument in \(D\) maximizing \(f\).

Similarly, \(\text{argmax}_i x_i\) refers to the index (or indices, if ties) of the largest value in the set \(\{x_i\}\).

Check your understanding: what does the following return?

np.array([1, 5, 10, 2]).argmin()

Argmin and argmax

These index retrieval functions can be handy for slicing rows of interest.

For example, which country had the largest percentage of women in parliament in the year the UN development data was collected?

undev.index[undev.parliament_pct_women.argmax()]

'Rwanda'

And what were the observations?

undev.iloc[undev.parliament_pct_women.argmax(), :]

total_pop                    12.600
urban_pct_pop                17.300
pop_under5                    1.800
pop_15to64                    7.200
pop_over65                    0.400
gender_inequality             0.402
parliament_pct_women         55.700
labor_participation_women    83.900
labor_participation_men      83.400
Name: Rwanda, dtype: float64

Dataframe aggregations

In pandas, the numpy aggregation operations are available as dataframe methods that apply the corresponding operation over each column:

# mean of every column
undev.mean()

total_pop                    40.423810
urban_pct_pop                58.660847
pop_under5                    3.666120
pop_15to64                   27.250820
pop_over65                    3.797814
gender_inequality             0.344154
parliament_pct_women         23.093048
labor_participation_women    52.139888
labor_participation_men      72.470787
dtype: float64

Row-wise aggregation

In general, supplying the argument axis = 1 will compute rowwise aggregations. For example:

# sum `pop_under5`, `pop_15to64`, and `pop_over65`
undev.iloc[:, 2:5].sum(axis = 1).head(3)

country
Afghanistan    27.5
Albania         2.6
Algeria        34.9
dtype: float64

This facilitates, for example:

undev['pop_5to14'] = undev.total_pop - undev.iloc[:, 2:5].sum(axis = 1)

Argmin/idxmin and argmax/idxmax

In pandas, np.argmin() and np.argmax() are implemented as pd.df.idxmin() and pd.df.idxmax().

undev.idxmax()

total_pop                                     China
urban_pct_pop                Hong Kong, China (SAR)
pop_under5                                    India
pop_15to64                                    China
pop_over65                                    China
gender_inequality                             Yemen
parliament_pct_women                         Rwanda
labor_participation_women                    Rwanda
labor_participation_men                       Qatar
pop_5to14                                     India
dtype: object

Other functions

Pandas has a wide array of other aggregation and transformation functions. To show just one example:

## slice weather data
weather4 = weather1.set_index('DATE').iloc[:, 2:4]
weather4.head(2)

	TMAX	TMIN
DATE
1/1/2021	65	37
1/2/2021	62	38

# rolling average
weather4.rolling(window = 7).mean().head(10)

	TMAX	TMIN
DATE
1/1/2021	NaN	NaN
1/2/2021	NaN	NaN
1/3/2021	NaN	NaN
1/4/2021	NaN	NaN
1/5/2021	NaN	NaN
1/6/2021	NaN	NaN
1/7/2021	66.285714	39.571429
1/8/2021	68.285714	39.428571
1/9/2021	69.571429	39.571429
1/10/2021	70.571429	38.857143

Check your understanding

Interpret this result:

weather4.rolling(window = 7).mean().idxmax()

TMAX    1/20/2021
TMIN    3/24/2021
dtype: object

(The weather data is January through March.)

Custom functions

See the documentation for a comprehensive list of transformations and aggregations.

If pandas doesn’t have a method for an operation you’re wanting to perform, you can implement custom transformations/aggregations with:

pd.df.apply() or pd.df.transform() apply a function row-wise or column-wise
pd.df.agg() or pd.df.aggregate()

Custom functions

Here’s an example:

	1961	1962	1963	1964	1965	1966	1967	1968	1969	1970	...	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019
Country Name
Argentina	5.427843	-0.852022	-5.308197	10.130298	10.569433	-0.659726	3.191997	4.822501	9.679526	3.045643	...	10.125398	6.003952	-1.026420	2.405324	-2.512615	2.731160	-2.080328	2.818503	-2.565352	-2.088015
Australia	2.485769	1.296087	6.214630	6.978522	5.983506	2.382458	6.302620	5.095814	7.044329	7.172187	...	2.067417	2.462756	3.918163	2.584898	2.533115	2.192647	2.770652	2.300611	2.949286	2.160956

2 rows × 59 columns

# convert percentages to proportions
gdp_prop = gdp.transform(lambda x: x/100 + 1)

# compute geometric mean
gdp_prop.aggregate(
    lambda x: np.prod(x)**(1/len(x)), 
    axis = 1).head(4)

Country Name
Argentina    1.022831
Australia    1.034228
Austria      1.027254
Burundi      1.023854
dtype: float64

Your turn

Here’s the country with the highest annualized GDP growth for the period 1961-2019:

Country Name
Botswana    1.079442
dtype: float64

How did I find this? Suppose that the result on the previous slide were stored as gdp_annualized. Write a line of code that generates the result shown above.

Grouped aggregations

Suppose we wanted to compute annualized growth by decade for each country.

To do so, we’d compute the same aggregation (geometric mean) repeatedly for subsets of data values. This is called a grouped aggregation.

Usually, one defines a grouping of dataframe rows using columns in the dataset. For example:

gdp_decades.head(4)

	Country Name	growth	decade
0	Argentina	1.054278	1960
1	Australia	1.024858	1960
2	Austria	1.055380	1960
3	Burundi	0.862539	1960

How should the rows be grouped?

`.groupby`

In pandas, df.groupby('COLUMN') defines a grouping of dataframe rows in which each group is a set of rows with the same value of 'COLUMN'.

There will be exactly as many groups as the number of unique values in 'COLUMN'.
Multiple columns may be specified to define a grouping, e.g., df.groupby(['COL1', 'COL2'])
Subsequent operations will be performed group-wise

Annualized GDP growth by decade

Returning to our example:

gdp_anngrowth = gdp_decades.groupby(
    ['Country Name', 'decade']
    ).aggregate(
    lambda x: np.prod(x)**(1/len(x))
    )

gdp_anngrowth

		growth
Country Name	decade
Algeria	1960	1.030579
	1970	1.068009
	1980	1.027661
	1990	1.015431
	2000	1.038750
...	...	...
Zimbabwe	1970	1.038505
	1980	1.051066
	1990	1.027630
	2000	0.944765
	2010	1.055855

714 rows × 1 columns

Your turn

How do you find the country with the highest annualized GDP growth for each decade?

Write a line of code that would perform this calculation.

gdp_anngrowth...

	growth
decade
1960	(Iran, Islamic Rep., 1960)
1970	(Botswana, 1970)
1980	(Botswana, 1980)
1990	(China, 1990)
2000	(Myanmar, 2000)
2010	(China, 2010)

Recap

In tidy data, rows and columns correspond to observations and variables.
- This provides a standard dataset structure that facilitates exploration and analysis.
- Many datasets are not stored in this format.
- Transformation operations are a lot easier with tidy data, due in part to the way tools in pandas are designed.
Transformations are operations that modify the shape or values of dataframes. We discussed
- slicing
- filtering
- creating new variables
- aggregations (mean, min, max, argmin, etc.)
- grouped aggregations
Dataframe manipulations will be used throughout the course to tidy up data and perform various inspections and summaries.

Up next

We started en media res at this stage of the lifecyle (tidy) so that you could start developing skills that would enable you to jump right into playing with datasets.

Next week, we’ll backtrack to the data collection and assessment stages of a project and discuss:

sampling
scope of inference
data assessment
missing data

Dataframe Transformations

Recap: tidy data

Why tidy?

Transformations

Slicing

Data display

Data inspection/retrieval

Review: .loc and .iloc

Larger slices

Defining new variables

Your turn

Filtering

Filtering

A small puzzle

Logical comparisons

Aggregation

Aggregation vs. other transformations

Aggregation?

Example aggregations

Argmin and argmax

Argmin and argmax

Dataframe aggregations

Row-wise aggregation

Argmin/idxmin and argmax/idxmax

Other functions

Check your understanding

Custom functions

Custom functions

Your turn

Grouped aggregations

.groupby

Annualized GDP growth by decade

Your turn

Recap

Up next

Review: `.loc` and `.iloc`

`.groupby`