Lab Week 03

Overview

Welcome to the Tutorial! This lab builds on the concepts covered in yesterday’s lecture. You have up to 2 hours in the lab to review what you have learned and apply it through a series of exercises. Some code snippets are provided for you to execute, but do not just run them blindly – take the time to understand what each line does.

Today we focus on data visualisation using three Python libraries, each with different strengths. By the end of the lab you will know which package is best suited for a given task and be able to produce common chart types with each.

Duration: approximately 2 hours

What you need:

  • Python 3.10+ with pandas, matplotlib, seaborn, altair, and palmerpenguins installed
  • A Python script, Jupyter notebook, or Quarto document to write your code

Structure:

Part Topic Approx. Time
1 Palmer Penguins dataset 10 min
2 Matplotlib 30 min
3 Seaborn 40 min
4 Altair 30 min
Final challenge 10 min
A Note on Polars

I promised to prepare the Polars package material, but could not get it ready in time. We will cover Polars later in a future lecture.


Part 1: The Palmer Penguins Dataset

About the Data

The palmerpenguins data package (originally an R package, also available in Python) contains measurements that serve as an alternative to Anderson’s classic Iris dataset. It records body weight, bill length, flipper length, and body mass for three species of penguins: Adelie, Gentoo, and Chinstrap.

The dataset was collected by Dr Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER) programme, part of the US National Science Foundation. Data were gathered during the breeding season from 2007 to 2009 at Palmer Station, Antarctica.

The Palmer Penguins dataset contains 344 observations and 8 variables:

  • species (categorical): the species of penguin (Adelie, Gentoo, or Chinstrap)
  • island (categorical): the island where the penguin was observed (Biscoe, Dream, or Torgersen)
  • bill_length_mm (numerical): the length of the penguin’s bill in millimetres
  • bill_depth_mm (numerical): the depth of the penguin’s bill in millimetres
  • flipper_length_mm (numerical): the length of the penguin’s flipper in millimetres
  • body_mass_g (numerical): the body mass of the penguin in grams
  • sex (categorical): the sex of the penguin (male or female)
  • year (numerical): the year the data was collected (2007, 2008, or 2009)
What are Culmen Length and Depth?

The culmen is the upper ridge of a bird’s beak. In the simplified penguins subset, culmen length and depth have been renamed to bill_length_mm and bill_depth_mm.

Data citation: Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081

Licence: Data are available by CC-0 licence in accordance with the Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data.

Task 1: Loading the Data

# Install if needed (run once):
# pip install palmerpenguins

import pandas as pd
import numpy as np
from palmerpenguins import load_penguins

penguins = load_penguins()
print(f"Shape: {penguins.shape}")
penguins.head()
Shape: (344, 8)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
penguins.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB
penguins.describe()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
count 342.000000 342.000000 342.000000 342.000000 344.000000
mean 43.921930 17.151170 200.915205 4201.754386 2008.029070
std 5.459584 1.974793 14.061714 801.954536 0.818356
min 32.100000 13.100000 172.000000 2700.000000 2007.000000
25% 39.225000 15.600000 190.000000 3550.000000 2007.000000
50% 44.450000 17.300000 197.000000 4050.000000 2008.000000
75% 48.500000 18.700000 213.000000 4750.000000 2009.000000
max 59.600000 21.500000 231.000000 6300.000000 2009.000000

Your turn: How many missing values are there in each column? Use penguins.isna().sum() to find out. Which columns have missing data?

Task 2: Quick Cleaning

For this lab we will drop rows with missing values so that every plot works cleanly. In a real analysis you might choose to handle these differently.

penguins = penguins.dropna().copy()
print(f"Clean shape: {penguins.shape}")
Clean shape: (333, 8)
# How many penguins per species?
print(penguins["species"].value_counts())
species
Adelie       146
Gentoo       119
Chinstrap     68
Name: count, dtype: int64
# How many per island?
print(penguins["island"].value_counts())
island
Biscoe       163
Dream        123
Torgersen     47
Name: count, dtype: int64

Part 2: Matplotlib

matplotlib is the foundational plotting library in Python. It is very well tested, robust, and can reproduce just about any plot (sometimes with a lot of effort). The trade-off is that the syntax can be verbose and imperative, and it has limited support for interactive or web graphics out of the box.

You do not need to memorise the syntax for every plotting function. The matplotlib gallery is an excellent reference.

Task 3: Scatter Plot

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 6))

# Colour each species differently
colours = {"Adelie": "#FF6B35", "Chinstrap": "#7B2D8E", "Gentoo": "#1B998B"}

for species, group in penguins.groupby("species"):
    ax.scatter(
        group["bill_length_mm"],
        group["bill_depth_mm"],
        label=species,
        color=colours[species],
        alpha=0.7,
        edgecolors="white",
        linewidth=0.5
    )

ax.set_xlabel("Bill Length (mm)")
ax.set_ylabel("Bill Depth (mm)")
ax.set_title("Penguin Bill Dimensions by Species", fontweight="bold")
ax.legend(title="Species")
plt.tight_layout()
plt.show()

Notice how we loop through each species group and plot them separately to assign colours. This is the imperative style that is characteristic of matplotlib.

Your turn: Create a scatter plot of flipper_length_mm (x-axis) vs body_mass_g (y-axis), coloured by species. Add axis labels and a legend.

Task 4: Histogram

fig, ax = plt.subplots(figsize=(8, 5))

for species, group in penguins.groupby("species"):
    ax.hist(
        group["body_mass_g"],
        bins=20,
        alpha=0.5,
        label=species,
        color=colours[species],
        edgecolor="white"
    )

ax.set_xlabel("Body Mass (g)")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of Body Mass by Species", fontweight="bold")
ax.legend(title="Species")
plt.tight_layout()
plt.show()

Your turn: Create a histogram of flipper_length_mm for each island (not species). Use 15 bins. What can you see about the distributions?

Task 5: Bar Chart

# Mean body mass by species
species_mass = penguins.groupby("species")["body_mass_g"].mean().sort_values()

fig, ax = plt.subplots(figsize=(7, 4))
species_mass.plot.barh(
    ax=ax,
    color=[colours[s] for s in species_mass.index],
    edgecolor="white"
)
ax.set_xlabel("Mean Body Mass (g)")
ax.set_title("Average Body Mass by Species", fontweight="bold")
plt.tight_layout()
plt.show()

Your turn: Create a grouped bar chart showing the mean bill length for each species, split by sex. Hint: use penguins.groupby(["species", "sex"])["bill_length_mm"].mean().unstack() and then .plot.bar().

Task 6: Multi-Panel Figure

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

measurements = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm"]
titles = ["Bill Length (mm)", "Bill Depth (mm)", "Flipper Length (mm)"]

for ax, col, title in zip(axes, measurements, titles):
    for species, group in penguins.groupby("species"):
        ax.hist(group[col], bins=15, alpha=0.5, label=species, color=colours[species])
    ax.set_title(title, fontweight="bold")
    ax.set_ylabel("Frequency")
    ax.legend(fontsize=8)

plt.suptitle("Distributions of Penguin Measurements", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()


Part 3: Seaborn

seaborn is built on top of matplotlib and adds a high-level interface for drawing statistical graphics. It integrates closely with pandas DataFrames and handles grouping, faceting, and colour palettes automatically.

Key tutorial pages to bookmark:

Task 7: Scatter Plot with Regression Line

import seaborn as sns

sns.set_style("whitegrid")

fig, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(
    data=penguins,
    x="bill_length_mm",
    y="bill_depth_mm",
    hue="species",
    palette=colours,
    alpha=0.7,
    ax=ax
)
ax.set_title("Penguin Bill Dimensions by Species", fontweight="bold")
plt.tight_layout()
plt.show()

Notice how much less code this takes compared to the matplotlib version. Seaborn handles the grouping and legend automatically.

# Add regression lines per species using lmplot
g = sns.lmplot(
    data=penguins,
    x="bill_length_mm",
    y="bill_depth_mm",
    hue="species",
    palette=colours,
    height=6,
    aspect=1.2
)
g.set_axis_labels("Bill Length (mm)", "Bill Depth (mm)")
g.figure.suptitle("Bill Dimensions with Regression Lines", fontweight="bold", y=1.02)
plt.show()

Simpson’s Paradox

If you were to fit a single regression line across all species, it would slope downwards, suggesting that longer bills have shallower depth. But within each species the relationship is positive. This is an example of Simpson’s Paradox, where a trend that appears in several groups reverses when the groups are combined.

Your turn: Create an lmplot of flipper_length_mm vs body_mass_g, coloured by species. Is there a positive relationship within each species?

Task 8: Box Plot and Violin Plot

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
sns.boxplot(
    data=penguins,
    x="species",
    y="body_mass_g",
    palette=colours,
    ax=axes[0]
)
axes[0].set_title("Body Mass: Box Plot", fontweight="bold")
axes[0].set_ylabel("Body Mass (g)")

# Violin plot
sns.violinplot(
    data=penguins,
    x="species",
    y="body_mass_g",
    palette=colours,
    ax=axes[1]
)
axes[1].set_title("Body Mass: Violin Plot", fontweight="bold")
axes[1].set_ylabel("Body Mass (g)")

plt.tight_layout()
plt.show()
/var/folders/th/wzvst6957_v02h6ztkc7qxw5_3lzg7/T/ipykernel_43349/286380130.py:4: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.


/var/folders/th/wzvst6957_v02h6ztkc7qxw5_3lzg7/T/ipykernel_43349/286380130.py:15: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

Box plots show medians, quartiles, and outliers. Violin plots show the full shape of the distribution. The violin plot reveals, for example, that Gentoo penguins have a bimodal mass distribution (likely reflecting sexual dimorphism).

Your turn: Create a box plot of bill_length_mm by species, split by sex using the hue parameter. Which species shows the largest difference between males and females?

Task 9: Pair Plot

sns.pairplot(
    penguins,
    hue="species",
    palette=colours,
    diag_kind="kde",
    plot_kws={"alpha": 0.6}
)
plt.suptitle("Pair Plot of Penguin Measurements", fontweight="bold", y=1.02)
plt.show()

A pair plot produces scatter plots for every combination of numerical variables and distributions along the diagonal. This is a powerful way to spot clusters, correlations, and outliers across multiple dimensions at once.

Your turn: Create a pair plot using only the columns bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g, coloured by island instead of species. Do the islands show distinct clusters?

Task 10: Heatmap of Correlations

# Compute correlation matrix for numeric columns
numeric_cols = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
corr = penguins[numeric_cols].corr()

fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(
    corr,
    annot=True,
    fmt=".2f",
    cmap="coolwarm",
    center=0,
    linewidths=0.5,
    ax=ax
)
ax.set_title("Correlation Matrix: Penguin Measurements", fontweight="bold")
plt.tight_layout()
plt.show()

Your turn: Create a separate correlation heatmap for each species. Use a 1x3 subplot layout. Do the correlation patterns differ between species?

Task 11: FacetGrid

g = sns.FacetGrid(
    penguins,
    col="species",
    hue="sex",
    height=4,
    aspect=1
)
g.map_dataframe(sns.scatterplot, x="bill_length_mm", y="bill_depth_mm", alpha=0.7)
g.add_legend()
g.set_axis_labels("Bill Length (mm)", "Bill Depth (mm)")
g.figure.suptitle("Bill Dimensions by Species and Sex", fontweight="bold", y=1.02)
plt.show()

FacetGrid creates a grid of panels, one per category. This is ideal for comparing patterns across groups without overloading a single plot.


Part 4: Altair

Altair takes a fundamentally different approach from matplotlib and seaborn. Instead of writing imperative instructions (“draw a dot here, add a line there”), you write a declarative specification: you describe what the visualisation should show and let the library determine the details. Altair relies on the Vega-Lite JavaScript grammar of graphics.

# Install if needed (run once):
# pip install altair

import altair as alt

Task 12: Basic Scatter Plot

chart = alt.Chart(penguins).mark_point().encode(
    x="bill_length_mm:Q",
    y="bill_depth_mm:Q",
    color="species:N"
).properties(
    title="Penguin Bill Dimensions by Species",
    width=500,
    height=400
)

chart

The key concepts in Altair are:

  • mark_point(), mark_bar(), mark_line(), etc. define the geometry
  • .encode() maps data columns to visual channels (x, y, colour, size, shape)
  • Type suffixes: :Q (quantitative), :N (nominal/categorical), :O (ordinal), :T (temporal)

Your turn: Modify the chart to use flipper_length_mm on the x-axis and body_mass_g on the y-axis. Add size="body_mass_g:Q" to the encoding. What happens?

Task 13: Interactive Scatter Plot

selection = alt.selection_point(fields=["species"])

chart = alt.Chart(penguins).mark_circle(size=80).encode(
    x="bill_length_mm:Q",
    y="bill_depth_mm:Q",
    color=alt.condition(
        selection,
        "species:N",
        alt.value("lightgrey")
    ),
    tooltip=["species", "island", "bill_length_mm", "bill_depth_mm", "body_mass_g"]
).add_params(
    selection
).properties(
    title="Click a Species to Highlight",
    width=500,
    height=400
)

chart

Click on a point or legend entry to highlight that species. Hover over points to see tooltips. This interactivity comes “for free” with Altair – no additional code is needed beyond the selection definition.

Your turn: Add shape="island:N" to the encoding so that each island uses a different marker shape. Can you tell which island has which species?

Task 14: Bar Chart

chart = alt.Chart(penguins).mark_bar().encode(
    x=alt.X("species:N", title="Species"),
    y=alt.Y("count():Q", title="Count"),
    color="species:N"
).properties(
    title="Number of Penguins by Species",
    width=400,
    height=300
)

chart
# Grouped bar chart: species by sex
chart = alt.Chart(penguins).mark_bar().encode(
    x=alt.X("species:N", title="Species"),
    y=alt.Y("mean(body_mass_g):Q", title="Mean Body Mass (g)"),
    color="sex:N",
    xOffset="sex:N"
).properties(
    title="Mean Body Mass by Species and Sex",
    width=400,
    height=300
)

chart

Your turn: Create a stacked bar chart showing the count of penguins per island, with each bar segmented by species. Hint: encode x="island:N", y="count():Q", and color="species:N".

Task 15: Faceted Charts

chart = alt.Chart(penguins).mark_point(filled=True, size=60).encode(
    x=alt.X("flipper_length_mm:Q", title="Flipper Length (mm)"),
    y=alt.Y("body_mass_g:Q", title="Body Mass (g)"),
    color="sex:N"
).facet(
    column="species:N"
).properties(
    title="Flipper Length vs Body Mass by Species"
)

chart

Altair’s .facet() works like seaborn’s FacetGrid but with a declarative syntax.

Task 16: Histogram and Density

chart = alt.Chart(penguins).mark_bar(opacity=0.6).encode(
    x=alt.X("body_mass_g:Q", bin=alt.Bin(maxbins=30), title="Body Mass (g)"),
    y=alt.Y("count():Q", title="Frequency"),
    color="species:N"
).properties(
    title="Distribution of Body Mass by Species",
    width=500,
    height=300
)

chart

Your turn: Create a layered density plot of bill_length_mm using mark_area(opacity=0.5) with transform_density(). Refer to the Altair density transform docs.


Comparing the Three Libraries

Feature matplotlib seaborn Altair
Style Imperative (step-by-step) High-level statistical Declarative (specify what)
Interactivity Limited (requires extra work) Limited (inherits matplotlib) Built-in (hover, click, zoom)
Statistical plots Manual Built-in (regression, distributions) Via transforms
Learning curve Steeper syntax Easier for stats plots Different paradigm
Best for Full control, publication figures Statistical exploration Interactive dashboards, web
Which Library Should I Use?

There is no single “best” library. Use matplotlib when you need pixel-level control or are building complex custom layouts. Use seaborn for statistical exploration and when you want to produce common statistical plots quickly. Use Altair when you want interactivity, web-ready outputs, or a concise declarative syntax.


Final Challenge

Putting It All Together

Using any combination of matplotlib, seaborn, and Altair, produce a short visual analysis of the Palmer Penguins data that addresses the following question:

How do the physical characteristics of the three penguin species relate to each other, and do these relationships differ by sex or island?

Your analysis should include:

  1. At least three different chart types (e.g. scatter, box, bar, heatmap, pair plot)
  2. At least two different libraries from today’s lab
  3. At least one interactive chart using Altair
  4. Written interpretation: below each figure, write 2–3 sentences explaining what the visualisation reveals
Ideas to Explore
  • Does Simpson’s Paradox appear in any other variable combinations beyond bill length vs depth?
  • Which measurement best separates the three species?
  • Is sexual dimorphism (size difference between males and females) consistent across species?
  • Do penguins from different islands within the same species differ in size?

Time: approximately 10 minutes


What You Have Learnt

This lab introduced three Python visualisation libraries through the Palmer Penguins dataset:

  1. matplotlib gave you full control over every element of a plot, at the cost of more verbose code. You created scatter plots, histograms, bar charts, and multi-panel figures.
  2. seaborn streamlined statistical plotting with automatic grouping, faceting, and distribution visualisations. You used scatter plots, regression plots, box and violin plots, pair plots, heatmaps, and FacetGrids.
  3. Altair introduced a declarative approach with built-in interactivity, tooltips, and concise specifications. You created interactive scatter plots, bar charts, faceted charts, and histograms.

These visualisation skills complement the data wrangling from Lab Week 02 and will be essential when you begin spatial data visualisation with GeoPandas in the coming weeks.