```
# Initialize Otter
import otter
= otter.Notebook("hw3-diatom.ipynb") grader
```

# Background: diatoms and paleoclimatology

```
import numpy as np
import pandas as pd
import altair as alt
from sklearn.decomposition import PCA
import statsmodels.api as sm
# disable row limit for plotting
alt.data_transformers.disable_max_rows()# uncomment to ensure graphics display with pdf export
# alt.renderers.enable('mimetype')
```

Diatoms are a type of phytoplankton – they are photosynthetic algae that function as primary producers in aquatic ecosystems. Diatoms are at the bottom of the food web: they are consumed by filter feeders, like clams, mussels, and many fish, which are in turn consumed by larger organisms like scavengers and predators and, well, us. As a result, changes in the composition of diatom species in marine ecosystems have ripple effects that can dramatically alter overall community structure in any environment of which marine life forms a part.

Diatoms have glass bodies. As a group of organisms, they display a great diversity of body shapes, and many are quite elaborate. The image below, taken from a Scientific American article, shows a small sample of their shapes and structures.

Because they are made of glass, diatoms preserve extraordinarily well over time. When they die, their bodies sink and form part of the sediment. Due to their abundance, there is a sort of steady rain of diatoms forming part of the sedimentation process, which produces sediment layers that are dense with diatoms.

Sedimentation is a long-term process spanning great stretches of time, and the deeper one looks in sediment, the older the material. Since diatoms are present in high density throughout sedimentation layers, and they preserve so well, it is possible to study their presence over longer time spans – potentially hundreds of thousands of years.

A branch of paleoclimatology is dedicated to studying changes in biological productivity on geologic time scales, and much research in this area has involved studying the relative abundances of diatoms. In this assignment, you’ll do just that on a small scale and work with data from sediment cores taken in the gulf of California at the location indicated on the map:

The data is publicly available: > Barron, J.A., *et al.* 2005. High Resolution Guaymas Basin Geochemical, Diatom, and Silicoflagellate Data. IGBP PAGES/World Data Center for Paleoclimatology Data Contribution Series # 2005-022. NOAA/NGDC Paleoclimatology Program, Boulder CO, USA.

In this assignment, you’ll use the exploratory techniques we’ve been discussing in class to analyze the relative abundances of diatom taxa over a time span of 15,000 years. This will involve practicing the following:

- data import and preprocessing
- graphical techniques for visualizing distributions
- multivariate analysis with PCA

# Diatom data

The data are diatom counts sampled from evenly-spaced depths in a sediment core from the gulf of California. In sediment cores, depth correlates with time before the present – deeper layers are older – and depths are typically chosen to obtain a desired temporal resolution. The counts were recorded by sampling material from sediment cores at each depth, and examining the sampled material for phytoplankton cells. For each sample, phytoplankton were identified at the taxon level and counts of diatom taxa were recorded along with the total number of phytoplankton cells identified. Thus:

- The
**observational units**are.**sediment samples**

- The
**variables**are. Age is inferred from radiocarbon.**depth (age), diatom abundance counts, and the total number of identified phytoplankton** - One
**observation**is made atfrom 0cm (surface) to 13.71 cm.**each depth**

The table below provides variable descriptions and units for each column in the dataframe.

Variable | Description | Units |
---|---|---|

Depth | Depth interval location of sampled material in sediment core | Centimeters (cm) |

Age | Radiocarbon age | Thousands of years before present (KyrBP) |

A_curv | Abundance of Actinocyclus curvatulus |
Count (n) |

A_octon | Abundance of Actinocyclus octonarius |
Count (n) |

ActinSpp | Abundance of Actinoptychus species |
Count (n) |

A_nodul | Abundance of Azpeitia nodulifer |
Count (n) |

CocsinSpp | Abundance of Coscinodiscus species |
Count (n) |

CyclotSpp | Abundance of Cyclotella species |
Count (n) |

Rop_tess | Abundance of Roperia tesselata |
Count (n) |

StephanSpp | Abundance of Stephanopyxis species |
Count (n) |

Num.counted | Number of diatoms counted in sample | Count (n) |

The cell below imports the data.

```
# import diatom data
= pd.read_csv('data/barron-diatoms.csv')
diatoms_raw 5) diatoms_raw.head(
```

The data are already in tidy format, because each row is an observation (a set of measurements on one sample of sediment) and each column is a variable (one of age, depth, or counts). However, examine rows 3 and 4, and note:

- NaNs are present
- The number of individuals counted in each sample varies by a lot from sample to sample.

Let’s address those before conducting initial explorations.

The NaNs are an artefact of the data recording – if *no* diatoms in a particular taxa are observed, a `-`

is entered in the table (you can verify this by checking the .csv file). In these cases the value isn’t missing, but rather zero. These entries are parsed by pandas as NaNs, but they correspond to a value of 0 (no diatoms observed).

### Question 1: Filling NaNs

Use `.fill_na()`

to replace all NaNs by zeros, and store the result as `diatoms_mod1`

. Store rows 4 and 5 (index, not integer location) of the resulting dataframe as `diatoms_mod1_examplerows`

and display these rows.

```
= ...
diatoms_mod1
# print rows 4 and 5
= ...
diatoms_mod1_examplerows print(diatoms_mod1_examplerows)
```

`"q1") grader.check(`

Since the total number of phytoplankton counted in each sample varies, the raw counts are not directly comparable – *e.g.*, a count of 18 is actually a *different* abundance in a sample with 200 individuals counted than in a sample with 300 individuals counted.

For exploratory analysis, you’ll want the values to be comparable across rows. This can be achieved by a simple transformation so that the values are *relative* abundances: *proportions* of phytoplankton observed from each taxon.

### Question 2: Counts to proportions

Convert the counts to proportions by dividing by the relevant entry in the `Num.counted`

column. There are a few ways to do this, but here’s one approach:

- Set Depth and Age to row indices using
`.set_index(...)`

and store the result as`diatoms_mod2`

. - Store the
`Num.counted`

column from`diatoms_mod2`

as`sampsize`

. - Use
`.div(...)`

to divide entrywise every column in`diatoms_mod2`

by`sampsize`

and store the result as`diatoms_mod3`

. - Drop the
`Num.counted`

column from`diatoms_mod3`

and reset the index; store the result as`diatoms`

.

Carry out these steps and print the first four rows of `diatoms`

.

(*Hint*: careful with the `axis = ...`

argument in `.div(...)`

; you may want to look at the documentation and check one or two values manually to verify the calculation works as intended.)

```
# set depth, age to indices
= ...
diatoms_mod2
# store sample sizes
= ...
sampsize
# divide
= ...
diatoms_mod3
# drop num.counted and reset index
= ...
diatoms
# print
...
```

`"q2") grader.check(`

Take a moment to think about what the data represent. They are relative abundances over time; essentially, snapshots of the community composition of diatoms over time, and thus information about how ecological community composition changes.

Before diving in, it will be helpful to resolve two matters:

- How far back in time do the data go?
- What is the time resolution of the data?

### Question 3: Time span

What is the geological time span covered by the data? Compute the minimum and maximum age using `.aggregate(...)`

and store the result as `min_max_age`

. (You will need to use `.aggregate()`

to pass the automated tests.)

*Note*: This may be a new function for you, but it’s simple: it takes as an argument a list of functions that will be applied to the dataframe (columnwise by default). So for example, to get the mean and variance of each column in `df`

, one would use `df.aggregate(['mean', 'var'])`

. See the documentation for further examples.

*Remember*: age is reported as thousands of years before present, so `Age = 2`

means 2000 years ago.

```
= ...
min_max_age min_max_age
```

`"q3") grader.check(`

### Question 4: Time resolution

**(i) How are the observations spaced in time?**

Follow these steps:

- Extract the
`Age`

column from`diatoms`

, sort the values in ascending order, compute the differences between consecutive rows, and store the result as`diffs`

.*Hint*: use`.sort_values()`

and`.diff()`

.*Notice*: that the first difference is NaN, because there is no previous value to compare the first row with. Drop this entry when you store`diffs`

.

- Make a simple count histogram (no need to manually bin or convert to density scale) with bins of width 0.02 (20 years). Store this as
`fig1`

and display the figure.- Label the x axis ‘Time step between consecutive samples’
- Put the y axis on a square root scale so that bins with one observation are discernible

**(ii) What is the typical time step in years?**

Write your answer based on the count histogram.

*Type your answer here, replacing this text.*

```
# store differences
= ...
diffs
# construct histogram
= alt.Chart(diffs).mark_bar().encode(
fig1 = ...
x = ...
y
)
# display
fig1
```

`"q4") grader.check(`

# Univariate explorations

To begin, you’ll examine the variation in relative abundance over time for the eight individual taxa, one variable at a time.

Here are some initial questions in this spirit that will help you to hone in and develop more focuesed exploratory questions: * Which taxa are most and least abundant on average over time? * Which taxa vary the most over time?

These can be answered by computing simple summary statistics for each column in the diatom data.

### Question 5: Summary statistics

Use `.aggregate(...)`

to find the mean and standard deviation of relative abundances for each taxon. Follow these steps:

- Drop the depth and age variables before performing the aggregation.
- Use
`.transpose()`

to ensure that the table is rendered in long form (8 rows by 2 columns rather than 2 columns by 8 rows). - Store the resulting dataframe as
`diatom_summary`

display.

```
= ...
diatom_summary
# print the dataframe
diatom_summary
```

`"q5") grader.check(`

It will be easier to determine which taxa are most/least abundant and most variable by displaying this information visually.

### Question 6: Visualizing summary statistics

**(i) Create a plot of the average relative abundances and their variation over time.**

- Reset the index of
`diatom_summary`

so that the taxon names are stored as a column and not an index. Store the result as`plot_df`

. - Create an Altair chart based on
`plot_df`

with*no marks*– just`alt.Chart(...).encode(...)`

– and pass the columnn of taxon names to the`Y`

encoding channel with the title ‘Taxon’ and sorted in descending order of mean relative abundance. Store the result as`base`

.*Hint*:`alt.Y(..., sort = {'field': 'column', 'order': 'descending'})`

will sort the Y channel by ‘column’ in descending order.

- Modify
`base`

to create a point plot of the average relative abundances for each taxon; store the result as`means`

.- Average relative abundance (the mean you calculated in Q1 (a)) should appear on the x axis, and taxon on the y axis.
- Since the
`Y`

encoding was already specified in`base`

, you do not need to add a`Y`

encoding at this stage. - Give the x axis the title ‘Average relative abundance’.

- Modify
`base`

to create a plot with bars spanning two standard deviations in either direction from the mean. Store the result as`bars`

.- First use
`base.transform_calculate(...)`

to compute`lwr`

and`upr`

for the positions of the bar endpoints:- \(\texttt{lwr} = \texttt{mean} - 2\times\texttt{std}\)
- \(\texttt{upr} = \texttt{mean} + 2\times\texttt{std}\).

- Then append
`.mark_errorbar().encode(...)`

to the chain:- pass
`lwr:Q`

to the`X`

encoding channel with the title ‘Average relative abundance’ (to match the point plot) - pass
`upr:Q`

to the`X2`

encoding channel (no specific title needed).

- pass

- First use
- Layer the plots:
`means + bars`

. Store the result as`fig2`

and display the figure.

It may help to have a look at this example.

**(ii) Based on the figure, answer the following questions.**

- Which taxon is most abundant on average over time?
- Which taxon is most rare on average over time?
- Which taxon varies most in relative abundance over time?

*Type your answer here, replacing this text.*

```
# reset index
= ...
plot_df
# create base chart
= alt.Chart(plot_df).encode(
base = ...
y
)
# create point plot
= base.mark_point().encode(
means = ...
x
)
# create bar plot
= base.transform_calculate(
bars = ...
lwr = ...
upr
).mark_errorbar().encode(= ...
x = ...
x2
)
# layer
= ...
fig2
# display
fig2
```

Now that you have a sense of the typical abundances for each taxon (measured by means) and the variations in abundance (measured by standard deviations), you’ll dig in a bit further and examine the variation in abundance of the most variable taxon.

### Question 7: Distribution of *Azpeitia nodulifer* abundance over time

**(i) Construct a density scale histogram of the relative abundances of Azpeitia nodulifer and overlay a kernel density estimate.**

Use the `diatoms`

dataframe and a bin width of 0.03. Be sure to choose an appropriate kernel and smoothing bandwidth. Store the result as `fig3`

.

**(ii) Answer the following questions in a few sentences.**

- Which values are common?
- Which values are rare?
- How spread out are the values?
- Are values spread evenly or irregularly?

*Type your answer here, replacing this text.*

```
# density scale histogram
= alt.Chart(diatoms).transform_bin(
hist = ...
as_ = ...
field bin = ...
).transform_aggregate(= ...
Count = ...
groupby
).transform_calculate(= ...
Density = ...
binshift = 25, opacity = 0.8).encode(
).mark_bar(size = ...
x = ...
y
)
# compute density estimate
...
# plot kde
= ...
smooth
# layer
= ...
fig3
# display
fig3
```

Comment:There are a disproportionately large number of zeroes, because in many samples noAzpeitia noduliferdiatoms were observed. This is a common phenomenon in ecological data, and even has a name: it results in a ‘zero inflated’ distribution of values. The statistician to identify and name the phenomenon was Diane Lambert in 1992. Zero inflation can present a variety of challenges. You may have noticed, for example, that there was no bandwidth parameter for the KDE thatbothcaptured the shape of the histogram near zeroandaway from zero, regardless of the choice of kernel. The key difficulty with zero inflated data is that no common probability distribution fits the data well. This requires the use of mixtures as probability models, which are challenging to incorporate into common statistical models.

There was a transition between geological epochs during the time span covered by the diatom data. The oldest data points in the diatom data correspond to the end of the Pleistocene epoch (ice age), at which time there was a pronounced warming (Late Glacial Interstadial, 14.7 - 12.9 KyrBP) followed by a return to glacial conditions (Younger Dryas, 12.9 - 11.7 KyrBP).

This fluctuation can be seen from temperature reconstructions. Below is a plot of sea surface temperature reconstructions off the coast of Northern California. Data come from the following source:

Barron

et al., 2003. Northern Coastal California High Resolution Holocene/Late Pleistocene Oceanographic Data. IGBP PAGES/World Data Center for Paleoclimatology. Data Contribution Series # 2003-014. NOAA/NGDC Paleoclimatology Program, Boulder CO, USA.

The shaded region indicates the time window with unusually large flucutations in sea surface temperature; this window roughly corresponds to the dates of the climate event.

```
# import sea surface temp reconstruction
= pd.read_csv('data/barron-sst.csv')
seatemps
# line plot of time series
= alt.Chart(seatemps).mark_line().encode(
line = alt.X('Age', title = 'Thousands of years before present'),
x = 'SST'
y
)
# highlight region with large variations
= alt.Chart(
highlight
pd.DataFrame('SST': np.linspace(0, 14, 100),
{'upr': np.repeat(11, 100),
'lwr': np.repeat(15, 100)}
)= 0.2, color = 'orange').encode(
).mark_area(opacity = 'SST',
y = alt.X('upr', title = 'Thousands of years before present'),
x = 'lwr'
x2
)
# add smooth trend
= line.transform_loess(
smooth = 'Age',
on = 'SST',
loess = 0.2
bandwidth = 'black')
).mark_line(color
# layer
+ highlight + smooth line
```

### Question 8: Conditional distributions of relative abundance

Does the distribution of relative abundance of *Azpeitia nodulifer* differ when variation in sea temperatures was higher (before 11KyrBP)?

**(i) Plot kernel density estimates to show the distribution of relative abundances before and after 11KyrBP.**

Use the Altair implementation of Gaussian KDE: 1. Use `.transform_caluculate(...)`

to calculate an indicator variable, `pleistocene`

, that indicates whether `Age`

exceeds 11. 2. Use `.transform_density(...)`

to compute KDEs separately for observations of relative abundance before and after 11KyrBP. + *Hint*: group by `pleistocene`

3. Plot the KDEs distinguished by color; give the color legend the title ‘Before 11KyrBP’ and store the plot as `kdes`

. 4. Add a shaded area beneath the KDE curves. Adjust the opacity of the area to your liking.

Store the result as `fig4`

and display the figure.

**(ii) Does the distribution seem to change between epochs? If so, how?**

Answer based on the figure in a few sentences.

*Type your answer here, replacing this text.*

` ...`

# Visualizing community composition with PCA

So far you’ve seen that the abundances of one taxon – *Azpeitia nodulifer* – change markedly before and after a shift in climate conditions. In this part you’ll use PCA to compare variation in community composition among *all* eight taxa during the late Pleistocene and Holocene epochs.

### Question 9: Pairwise correlations in relative abundances

**(i) Compute the pairwise correlations between relative abundances and make a heatmap of the correlation matrix.**

Be sure to remove or set to indices the Depth and Age variables before computing the correlation matrix. Save the matrix as `corr_mx`

.

- Melt
`corr_mx`

to obtain a dataframe with three columns:`row`

, which contains the values of the index of`corr_mx`

(taxon names);`column`

, which contains the names of the columns of`corr_mx`

(also taxon names); and`Correlation`

, which contains the values of`corr_mx`

.- Store the result as
`corr_mx_long`

.

- Create an Altair chart based on
`corr_mx_long`

and construct the heatmap by following the examples indicated above.- Adjust the color scheme to
`blueorange`

over the extent (-1, 1) to obtain a diverging color gradient where a correlation of zero is blank (white). - Adjust the color legend to indicate the color values corresponding to correlations of 1, 0.5, 0, -0.5, and -1.
- Sort the rows and columns in ascending order of correlation.

- Adjust the color scheme to

**(ii) How does A. nodulifer seem to vary with the other taxa, if at all?**

Answer in a few sentences based on the heatmap.

*Type your answer here, replacing this text.*

```
= ...
corr_mx
# melt corr_mx
...
# construct heatmap
...
# display
fig5
```

`"q9") grader.check(`

### Question 10: Computing and selecting principal components

Here you’ll perform all of the calculations involved in PCA and check the variance ratios to select an appropriate number of principal components. The parts of this question correspond to the individual steps in this process.

**(i) Center and scale the data columns.**

For PCA it is usually recommended to center and scale the data; set Depth and Age as indices and center and scale the relative abundances. Store the normalized result as `pcdata`

.

**(ii) Compute the principal components.**

Compute *all 8* principal components. For this part you do not need to show any specific output.

**(iii) Examine the variance ratios.**

Create a dataframe called `pcvars`

with the variance information by following these steps:

- Store the proportion of variance explained (called
`.explained_variance_ratio_`

in the PCA output) as a dataframe named`pcvars`

with just one column named`Proportion of variance explained`

. - Add a column named
`Component`

to`pcvars`

with the integers 1 through 8 as values (indicating the component number). - Add a column named
`Cumulative variance explained`

to`pcvars`

that is the cumulative sum of`Proportion of variance explained`

.*Hint*: slice the`Proportion of variance explained`

column and use`.cumsum(axis = ...)`

.

For this part you do not need to show any specific output.

**(iv) Plot the variance explained by each PC.**

Use `pcvars`

to construct a dual-axis plot showing the proportion of variance explained (left y axis) and cumulative variance explained (right y axis) as a function of component number (x axis), with points indicating the variance ratios and lines connecting the points. Follow these steps:

- Construct a base chart that encodes only
`Component`

on the`X`

channel. Store this as`base`

. - Make a base layer for the proportion of variance explained that modifies
`base`

by encoding`Proportion of variance explained`

on the`Y`

channel. Store the result as`prop_var_base`

.- Give the
`Y`

axis title a distinct color of your choosing via`alt.Y(..., axis = alt.Axis(titleColor = ...))`

.

- Give the
- Make a base layer for the cumulative variance explained that modifies
`base`

by endocing`Cumulative variance explained`

on the`Y`

channel. Store the result as`cum_var_base`

.- Give the
`Y`

axis title another distinct color of your choosing via`alt.Y(..., axis = alt.Axis(titleColor = ...))`

.

- Give the
- Create a plot layer for the proportion of variance explained by combining points (
`prop_var_base.mark_point()`

) with lines (`prop_var_base.mark_line()`

). Store the result as`cum_var`

.- Apply the color you chose for the axis title to the points and lines.

- Repeat the previous step for the cumulative variance explained.
- Apply the color you chose for the axis title to the points and lines.

- Layer the plots together using
`alt.layer(l1, l2).resolve_scale(y = 'independent')`

.

Store the result as `fig6`

and display the figure.

**(v) How many PCs should be used?**

Propose an answer based on the variance explained plots and indicate how much total variation your proposed number of components capture jointly.

*Type your answer here, replacing this text.*

```
## (i) center and scale data
# helper variable pcdata_raw; set Depth and Age as indices
= ...
pcdata_raw
# center and scale the relative abundances
= ... pcdata
```

```
## (ii) compute pcs
= ...
pca ...
```

```
## (iii) retrieve variance info
# store proportion of variance explained as a dataframe
= ...
pcvars
# add component number as a new column
'Component'] = ...
pcvars[
# add cumulative variance explained as a new column
'Cumulative variance explained'] = ... pcvars[
```

```
## (iv) plot variance explained
# encode component axis only as base layer
= ...
base
# make a base layer for the proportion of variance explained
...
# make a base layer for the cumulative variance explained
...
# add points and lines to each base layer
...
# layer the layers
= ...
fig6
# display
fig6
```

`"q10") grader.check(`

Now that you’ve performed the calculations for PCA, you can move on to the fun/difficult part: figuring out what they say about the data.

The first step in this process is to examine the loadings. Each principal component is a linear combination of the relative abundances by taxon, and the loadings tell you *how* that combination is formed; the loadings are the linear combination coefficients, and thus correspond to the weight of each taxon in the corresponding principal component. Some useful points to keep in mind:

- a high loading value (negative or positive) indicates that a variable strongly influences the principal component;
- a negative loading value indicates that
- increases in the value of a variable
*decrease*the value of the principal component - and decreases in the value of a variable
*increase*the value of the principal component;

- increases in the value of a variable
- a positive loading value indicates that
- increases in the value of a variable
*increase*the value of the principal component - and decreases in the value of a variable
*decrease*the value of the principal component;

- increases in the value of a variable
- similar loadings between two or more variables indicate that the principal component reflects their
*average*; - divergent loadings between two sets of variables indicates that the principal component reflects their
*difference*.

### Question 11: Interpreting component loadings

**(i) Extract the loadings from pca.**

Store the loadings for the first two principal components (called `.components_`

in the PCA output) in a dataframe named `loading_df`

. Name the columns `PC1`

and `PC2`

, and append a column `Taxon`

with the corresponding variable names, and print the resulting dataframe.

**(ii) Construct loading plots**

Construct a line-and-point plot connecting the loadings of the first two principal components. Display the value of the loading on the y axis and the taxa names on the x axis, and show points indicating the loading values. Distinguish the PC’s by color, and add lines connecting the loading values for each principal component. Store the result as `fig7`

and display the figure – you may need to resize for better readability.

*Hint*: you will need to first melt `loading_df`

to long form with three columns – the taxon name, the principal component (1 or 2), and the value of the loading.

**(iii) Interpret the first principal component.**

In a few sentences, answer the following questions. 1. Which taxa are up-weighted and which are down-weighted in this component? 2. How would you describe the principal component in context (*e.g.*, average abundance among a group, differences in abundances, etc.)? 3. How would you interpret a larger value of the PC versus a smaller value of the PC in terms of diatom communnity composition?

**(iv) Interpret the second principal component.**

Answer the same questions for component 2.

*Type your answer here, replacing this text.*

```
## (i) retrieve loadings
# store the loadings as a data frame with appropriate names
= ...
loading_df
# add a column with the taxon names
'Taxon'] = ... loading_df[
```

```
## (ii) construct loading plots
# melt from wide to long
= loading_df.melt(
loading_plot_df = ...
id_vars = ...
var_name = ...
value_name
)
# create base layer with encoding
= alt.Chart(loading_plot_df).encode(
base = ...
x = ...
y = ...
color
)
# store horizontal line at zero
= ...
rule
# layer points + lines + rule to construct loading plot
= ...
fig7
# show
...
```

`"q11") grader.check(`

Recall that there was a shift in climate around 11,000 years ago, and *A. nodulifer* abundances seemed to differ before and after the shift.

You can now use PCA to investigate whether not just individual abundances but *community composition* may have shifted around that time. To that end, let’s think of the principal components as ‘community composition indices’:

- consider PC1 a nodulifer/non-nodulifer community composition index; and
- consider PC2 a complex community composition index.

A pattern of variation or covariation in the principal components can be thought of as reflecting a particular ecological community composition dynamic – a way that community composition varies throughout time. Here you’ll look for distinct patterns of variation/covariation before and after 11,000 years ago via an exploratory plot of the principal components.

### Question 12: Visualizing community composition shift

**(i) Project the centered and scaled data onto the first two component directions.**

This sounds a little more complicated than it is – all that means is compute the values of the principal components for each data point. Create a dataframe called `projected_data`

containing just the first two principal components as two columns named `PC1`

and `PC2`

, and two additional columns with the Age and Depth variables.

**(ii) Construct a scatterplot of PC1 and PC2 by epoch.**

Construct a scatterplot of the principal components with observations colored according to whether they occurred in the Pleistocene or Holocene epoch. Store the result as `fig8`

and display the figure.

**(iii) Comment on the plot: does there appear to be any change in community structure?**

Answer in a few sentences.

*Type your answer here, replacing this text.*

```
## (i) project pcdata onto first two components; store as data frame
# retrieve principal component scores for pc1 and pc2
= ...
projected_data
# adjust index
= ...
projected_data.index = ... projected_data
```

```
## (ii) construct scatterplot
...
# display
...
```

`"q12") grader.check(`

### (Optional) Question 13: Multi-panel visualization

Sometimes it’s helpful to see marginal distributions together with a scatterplot. Follow the steps below to create a multi-panel figure with marginal density estimates appended to the projected scatter from the previous question.

- Create an Altair chart based on
`projected_data`

and use`.transform_calculate(...)`

to define a variable`holocene`

that indicates whether`Age`

is older than 11,000 years. Store the result as`base`

. - Modify
`base`

to add points with the following encodings.- Pass PC1 to the
`X`

encoding channel and title the axis ‘A. Nodulifer/non-A. nodulifer composition’. - Pass PC2 to the
`Y`

encoding channel and title the axis ‘Complex community composition’. - Pass the variable you created in step 1. to the
`color`

encoding channel and title it ‘Holocene’. Store the result as`scatter`

.

- Pass PC1 to the
- Construct plots of kernel density estimates for each principal component conditional on age being older than 11,000 years:
- modify
`base`

to create a`top_panel`

plot with the KDE curves for PC1, with color corresponding to the age indicator from the`.transform_calculate(...)`

step in making the base layer; - modify
`base`

again to create a`side_panel`

plot with the KDE curves for PC2, rotated 90 degrees relative to the usual orientation (flip the typical axes), and with color corresponding to the age indicator from the`.transform_calculate(...)`

step in making the base layer.

- modify
- Then, resize these panels appropriately (top should be thin, side should be narrow), and use Altair’s faceting operators
`&`

(vertical concatenation) and`|`

(horizontal concatenation) to combine them with your scatterplot.

Store the result as `fig9`

and display the figure.

```
# make base layer
...
# data scatter
...
# construct side panel (kdes for pc2)
...
# facet
= ...
fig9
# display
fig9
```

# Communicating results

Take a moment to review and reflect on the results of your analysis in the previous parts. Think about how you would describe succinctly what you’ve learned from the diatom data.

### Question 14: Summary

Write a brief paragraph (3-5 sentences) that addresses the following questions by referring to your results above.

- How would you characterize the typical ecological community composition of diatom taxa before and after 11,000 years ago?
*Hint*: focus on the side and top panels and the typical values of each index in the two time periods.

- Does the variation in ecological community composition over time seem to differ before and after 11,000 years ago?
*Hint*: focus on the shape of data scatter.

*Type your answer here, replacing this text.*

### Question 15: Further work

What more might you like to know, given what you’ve learned? Pose a question that your exploratory analysis raises for you.

#### Answer

*Type your answer here.*

# Submission

- Save the notebook.
- Restart the kernel and run all cells. (
**CAUTION**: if your notebook is not saved, you will lose your work.) - Carefully look through your notebook and verify that all computations execute correctly and all graphics are displayed clearly. You should see
**no errors**; if there are any errors, make sure to correct them before you submit the notebook. - Download the notebook as an
`.ipynb`

file. This is your backup copy. - Export the notebook as PDF and upload to Gradescope.

To double-check your work, the cell below will rerun all of the autograder tests.

` grader.check_all()`