Lab Week 02

Overview

This lab takes you through a complete data analysis workflow using real hourly pedestrian counts from 21 sensor locations across Auckland’s city centre in 2024. The data are sourced from Heart of the City and record how many people passed each sensor during every hour of the year.

By the end of this lab, you will have practised the core skills from Sections 2.1, 2.2, and 2.3 of the course textbook, all applied to one real-world dataset.

Duration: approximately 2 hours (or longer if you complete all extension tasks)

What you need:

  • Python 3.10+ with pandas, matplotlib, and seaborn installed
  • The file akl_ped-2024.csv (https://raw.githubusercontent.com/dataandcrowd/GISCI343/refs/heads/main/pedestrians/akl_ped-2024.csv)
  • A Python script, Jupyter notebook, or Quarto document to write your code

Structure:

Part Topic Approx. Time
2.1 Python foundations review 30 min
2.2 DataFrames and pandas 60 min
2.3 Data visualisation 30 min
Final challenge 20 min

Part 2.1: Python Foundations Review

Before loading the pedestrian dataset, let us ensure you are comfortable with the Python building blocks that underpin everything in Parts 2.2 and 2.3.

Task 1: Variables and Data Types

Create variables that describe one of the pedestrian sensors. Use appropriate types: str for text, int for whole numbers, float for decimal numbers, and bool for true/false.

sensor_name = "45 Queen Street"
sensor_id = 5
avg_hourly_count = 487.3
is_active = True

print(f"Sensor: {sensor_name}")
print(f"ID: {sensor_id}")
print(f"Average hourly count: {avg_hourly_count}")
print(f"Active: {is_active}")
print(f"Type of avg_hourly_count: {type(avg_hourly_count)}")
Sensor: 45 Queen Street
ID: 5
Average hourly count: 487.3
Active: True
Type of avg_hourly_count: <class 'float'>
f-strings

The f"..." syntax (formatted string literals) lets you embed variables directly inside text. The expression inside {} is evaluated and inserted. This is the most readable way to build output messages in Python.

Your turn: Create variables for a different sensor location. Include the sensor’s street name, the number of hours it records per day (integer), and whether it has any missing data (boolean). Print all values using f-strings.

Task 2: Lists

Lists store ordered sequences of items. They are fundamental to working with collections of sensor names, count values, and time periods.

# Auckland CBD sensor locations
locations = [
    "107 Quay Street",
    "45 Queen Street",
    "30 Queen Street",
    "2 High Street",
    "150 K Road"
]

# Access by index (0-based)
print(f"First location: {locations[0]}")
print(f"Last location: {locations[-1]}")
print(f"Total locations: {len(locations)}")

# Slice to get a subset
print(f"Middle three: {locations[1:4]}")
First location: 107 Quay Street
Last location: 150 K Road
Total locations: 5
Middle three: ['45 Queen Street', '30 Queen Street', '2 High Street']
# Hourly counts for 107 Quay Street on 1 Jan 2024 (6am-9pm)
counts = [78, 201, 644, 394, 861, 1283, 1335, 1423, 1159, 1005, 995, 780, 602, 498, 310, 195]
hours = list(range(6, 22))

print(f"Hours recorded: {len(counts)}")
print(f"Maximum count: {max(counts)}")
print(f"Minimum count: {min(counts)}")
print(f"Total daily: {sum(counts):,}")
Hours recorded: 16
Maximum count: 1423
Minimum count: 78
Total daily: 11,763

Your turn: Create a new list that contains only the counts above 500. Use a for loop and an if statement to build the list. How many hours had more than 500 pedestrians?

Task 3: Loops

Loops let you repeat operations across collections. When working with pedestrian data, you will frequently iterate through locations, time periods, or rows of a table.

locations = ["107 Quay Street", "45 Queen Street", "30 Queen Street", "2 High Street", "150 K Road"]

print("=== Sensor Locations ===")
for loc in locations:
    print(f"  - {loc}")

print(f"\nTotal locations: {len(locations)}")
=== Sensor Locations ===
  - 107 Quay Street
  - 45 Queen Street
  - 30 Queen Street
  - 2 High Street
  - 150 K Road

Total locations: 5
# Loop through a list of counts
peak_counts = [1423, 1076, 1245, 129, 381]

total = 0
for count in peak_counts:
    total = total + count
    print(f"  Adding {count:,} -> Running total: {total:,}")

print(f"\nGrand total: {total:,}")
  Adding 1,423 -> Running total: 1,423
  Adding 1,076 -> Running total: 2,499
  Adding 1,245 -> Running total: 3,744
  Adding 129 -> Running total: 3,873
  Adding 381 -> Running total: 4,254

Grand total: 4,254

Your turn: Write a loop that prints only the locations with a peak count above 500. Hint: use zip(locations, peak_counts) to pair each name with its count.

Task 4: Conditionals

Conditionals let you make decisions in code. In pedestrian analysis, you frequently need to classify counts, flag anomalies, or filter by thresholds.

counts = [1423, 1076, 1245, 129, 381, 141, 176]

print("=== Traffic Classification ===")
for count in counts:
    if count > 1000:
        level = "HIGH"
    elif count > 300:
        level = "MEDIUM"
    else:
        level = "LOW"
    print(f"  {count:,} pedestrians -> {level}")
=== Traffic Classification ===
  1,423 pedestrians -> HIGH
  1,076 pedestrians -> HIGH
  1,245 pedestrians -> HIGH
  129 pedestrians -> LOW
  381 pedestrians -> MEDIUM
  141 pedestrians -> LOW
  176 pedestrians -> LOW
# Count how many locations fall into each category
high = 0
med = 0
low = 0

for count in counts:
    if count > 1000:
        high = high + 1
    elif count > 300:
        med = med + 1
    else:
        low = low + 1

print(f"High traffic: {high} locations")
print(f"Medium traffic: {med} locations")
print(f"Low traffic: {low} locations")
High traffic: 3 locations
Medium traffic: 1 locations
Low traffic: 3 locations

Your turn: Change the thresholds so that “HIGH” is above 800, “MEDIUM” is 200 to 800, and “LOW” is below 200. How does the classification change?

Task 5: Combining Concepts

Now combine variables, lists, loops, and conditionals in a single task.

# Hourly data for 107 Quay Street (1 Jan 2024, 6am-9pm)
counts = [78, 201, 644, 394, 861, 1283, 1335, 1423, 1159, 1005, 995, 780, 602, 498, 310, 195]

# Find the peak count and off-peak hours
peak_count = 0
off_peak = []

for count in counts:
    if count > peak_count:
        peak_count = count
    if count < 300:
        off_peak.append(count)

print(f"Peak count: {peak_count:,} pedestrians")
print(f"Number of off-peak hours (< 300): {len(off_peak)}")
print(f"Average count: {sum(counts) / len(counts):,.0f}")
Peak count: 1,423 pedestrians
Number of off-peak hours (< 300): 3
Average count: 735

Your turn: Write a loop that builds a new list containing only the counts above 1,000. How many hours had more than 1,000 pedestrians?


Part 2.2: DataFrames and Pandas

Now we move from basic Python to structured data analysis with pandas. Everything you practised in Part 2.1 (variables, lists, loops, conditionals) will reappear here, but pandas gives you far more powerful tools for working with tabular data.

Task 6: Importing Libraries and Reading Data

import pandas as pd
import numpy as np

The import ... as ... syntax loads an external library and assigns it a short alias. pd for pandas and np for numpy are universal conventions that you will see in virtually every data science tutorial and textbook.

url = "https://raw.githubusercontent.com/dataandcrowd/GISCI343/refs/heads/main/pedestrians/akl_ped-2024.csv"
ped = pd.read_csv(url)
print(f"Shape: {ped.shape}")
print(f"Rows: {ped.shape[0]}, Columns: {ped.shape[1]}")
Shape: (8795, 23)
Rows: 8795, Columns: 23

The .shape attribute returns a tuple of (rows, columns). With 5,856 rows and 23 columns, each row represents one hour at one date, and the 23 columns include Date, Time, and 21 sensor locations.

ped.head()
Date Time 107 Quay Street Te Ara Tahuhu Walkway Commerce Street West 7 Custom Street East 45 Queen Street 30 Queen Street 19 Shortland Street 2 High Street ... 210 Queen Street 205 Queen Street 8 Darby Street EW 8 Darby Street NS 261 Queen Street 297 Queen Street 150 K Road 183 K Road 188 Quay Street Lower Albert (EW) 188 Quay Street Lower Albert (NS)
0 2024-01-01 00:00:00 6:00-6:59 78.0 53.0 24.0 9.0 87.0 67.0 10.0 11.0 ... 61.0 16.0 8.0 22.0 75.0 77.0 15.0 68.0 53.0 115.0
1 2024-01-01 00:00:00 7:00-7:59 201.0 39.0 29.0 38.0 143.0 123.0 12.0 21.0 ... 97.0 31.0 6.0 21.0 93.0 73.0 21.0 56.0 74.0 81.0
2 2024-01-01 00:00:00 8:00-8:59 644.0 55.0 50.0 52.0 243.0 291.0 62.0 23.0 ... 196.0 46.0 9.0 34.0 186.0 157.0 31.0 75.0 132.0 150.0
3 2024-01-01 00:00:00 9:00-9:59 394.0 94.0 71.0 84.0 399.0 468.0 54.0 26.0 ... 279.0 106.0 20.0 45.0 305.0 245.0 36.0 79.0 185.0 205.0
4 2024-01-01 00:00:00 10:00-10:59 861.0 216.0 120.0 143.0 774.0 755.0 128.0 76.0 ... 604.0 187.0 36.0 88.0 576.0 381.0 84.0 158.0 287.0 239.0

5 rows × 23 columns

Your turn: Use ped.tail() to see the last 5 rows. What date and time does the dataset end on?

Task 7: Exploring the Data

Before any analysis, you need to understand the structure, types, and quality of your data.

ped.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8795 entries, 0 to 8794
Data columns (total 23 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Date                               8784 non-null   object 
 1   Time                               8783 non-null   object 
 2   107 Quay Street                    8783 non-null   float64
 3   Te Ara Tahuhu Walkway              8783 non-null   float64
 4   Commerce Street West               8783 non-null   float64
 5   7 Custom Street East               8783 non-null   float64
 6   45 Queen Street                    8783 non-null   float64
 7   30 Queen Street                    8783 non-null   float64
 8   19 Shortland Street                8783 non-null   float64
 9   2 High Street                      8783 non-null   float64
 10  1 Courthouse Lane                  8783 non-null   float64
 11  61 Federal Street                  8783 non-null   float64
 12  59 High Stret                      8783 non-null   float64
 13  210 Queen Street                   8783 non-null   float64
 14  205 Queen Street                   8783 non-null   float64
 15  8 Darby Street EW                  8783 non-null   float64
 16  8 Darby Street NS                  8783 non-null   float64
 17  261 Queen Street                   8783 non-null   float64
 18  297 Queen Street                   8783 non-null   float64
 19  150 K Road                         8783 non-null   float64
 20  183 K Road                         8783 non-null   float64
 21  188 Quay Street Lower Albert (EW)  8783 non-null   float64
 22  188 Quay Street Lower Albert (NS)  8783 non-null   float64
dtypes: float64(21), object(2)
memory usage: 1.5+ MB

The .info() method tells you: the number of non-null values per column (which reveals missing data), the data type of each column, and the memory usage. Notice that Date is stored as an object (text), not as a datetime. We will fix that shortly.

ped.describe()
107 Quay Street Te Ara Tahuhu Walkway Commerce Street West 7 Custom Street East 45 Queen Street 30 Queen Street 19 Shortland Street 2 High Street 1 Courthouse Lane 61 Federal Street ... 210 Queen Street 205 Queen Street 8 Darby Street EW 8 Darby Street NS 261 Queen Street 297 Queen Street 150 K Road 183 K Road 188 Quay Street Lower Albert (EW) 188 Quay Street Lower Albert (NS)
count 8783.000000 8783.000000 8783.000000 8783.000000 8783.000000 8783.000000 8783.000000 8783.000000 8783.000000 8783.000000 ... 8783.000000 8783.000000 8783.000000 8783.000000 8783.000000 8783.000000 8783.000000 8783.000000 8783.000000 8783.000000
mean 353.719003 258.291928 115.975293 319.424001 584.034385 734.438575 181.254355 90.696117 51.633383 154.845269 ... 589.641011 164.907776 77.187863 210.860412 610.111921 401.655699 143.848343 306.153364 204.309348 162.093134
std 330.970832 245.775336 98.407779 267.004082 457.874348 582.625755 164.456019 74.576693 39.821827 112.915587 ... 530.130768 255.473764 60.443651 157.516495 508.645900 310.911067 96.362818 194.502540 184.535930 168.758941
min 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2.000000 0.000000 0.000000
25% 78.000000 33.000000 23.000000 67.000000 121.000000 162.000000 32.000000 21.000000 13.000000 41.000000 ... 97.000000 9.000000 30.000000 65.500000 110.000000 90.000000 51.000000 118.000000 46.000000 29.000000
50% 268.000000 182.000000 100.000000 287.000000 540.000000 698.000000 147.000000 82.000000 49.000000 153.000000 ... 449.000000 73.000000 70.000000 194.000000 546.000000 400.000000 151.000000 320.000000 174.000000 115.000000
75% 531.000000 418.500000 186.000000 512.000000 963.000000 1200.500000 298.000000 139.500000 79.000000 238.000000 ... 997.000000 195.000000 108.000000 328.500000 1038.000000 644.000000 214.000000 463.000000 292.000000 238.000000
max 2023.000000 1698.000000 784.000000 1894.000000 2358.000000 2601.000000 898.000000 579.000000 243.000000 779.000000 ... 2422.000000 1716.000000 791.000000 1455.000000 3059.000000 2430.000000 1021.000000 1642.000000 1257.000000 1072.000000

8 rows × 21 columns

.describe() provides summary statistics for all numeric columns: count, mean, standard deviation, min, max, and quartiles. Scan through these values. Are there any suspiciously low counts (possibly sensor outages) or unusually high values (possible data errors)?

# How many sensor locations?
sensor_cols = ped.columns[2:]  # Everything after Date and Time
print(f"Sensor locations: {len(sensor_cols)}")
print(f"\nColumn names:")
for col in sensor_cols:
    print(f"  - {col}")
Sensor locations: 21

Column names:
  - 107 Quay Street
  - Te Ara Tahuhu Walkway
  - Commerce Street West
  - 7 Custom Street East
  - 45 Queen Street
  - 30 Queen Street
  - 19 Shortland Street
  - 2 High Street
  - 1 Courthouse Lane
  - 61 Federal Street
  - 59 High Stret
  - 210 Queen Street
  - 205 Queen Street
  - 8 Darby Street EW
  - 8 Darby Street NS
  - 261 Queen Street
  - 297 Queen Street
  - 150 K Road
  - 183 K Road
  - 188 Quay Street Lower Albert (EW)
  - 188 Quay Street Lower Albert (NS)
Spot the Typo

Look carefully at the column names. One of them has a spelling error: “59 High Stret” instead of “59 High Street”. Real-world data is rarely perfectly clean. We will leave it as-is for now, but in a formal analysis you might rename it with ped.rename(columns={"59 High Stret": "59 High Street"}, inplace=True).

Your turn: Which sensor location has the highest mean count? Which has the most missing values? Use .describe() and .isna().sum() to find out.

Task 8: Selecting Columns and Rows

Selecting subsets of data is one of the most frequent operations in pandas.

# Select a single column (returns a Series)
queen45 = ped["45 Queen Street"]
print(type(queen45))
print(queen45.head())
<class 'pandas.core.series.Series'>
0     87.0
1    143.0
2    243.0
3    399.0
4    774.0
Name: 45 Queen Street, dtype: float64
# Select multiple columns (returns a DataFrame)
subset = ped[["Date", "Time", "45 Queen Street", "30 Queen Street"]]
subset.head()
Date Time 45 Queen Street 30 Queen Street
0 2024-01-01 00:00:00 6:00-6:59 87.0 67.0
1 2024-01-01 00:00:00 7:00-7:59 143.0 123.0
2 2024-01-01 00:00:00 8:00-8:59 243.0 291.0
3 2024-01-01 00:00:00 9:00-9:59 399.0 468.0
4 2024-01-01 00:00:00 10:00-10:59 774.0 755.0
# Select by position with .iloc[]
# First 5 rows, first 4 columns
ped.iloc[0:5, 0:4]
Date Time 107 Quay Street Te Ara Tahuhu Walkway
0 2024-01-01 00:00:00 6:00-6:59 78.0 53.0
1 2024-01-01 00:00:00 7:00-7:59 201.0 39.0
2 2024-01-01 00:00:00 8:00-8:59 644.0 55.0
3 2024-01-01 00:00:00 9:00-9:59 394.0 94.0
4 2024-01-01 00:00:00 10:00-10:59 861.0 216.0
# Select by label with .loc[]
# All rows, specific columns
ped.loc[:, ["Date", "Time", "150 K Road"]].head()
Date Time 150 K Road
0 2024-01-01 00:00:00 6:00-6:59 15.0
1 2024-01-01 00:00:00 7:00-7:59 21.0
2 2024-01-01 00:00:00 8:00-8:59 31.0
3 2024-01-01 00:00:00 9:00-9:59 36.0
4 2024-01-01 00:00:00 10:00-10:59 84.0

Your turn: Create a subset containing only the Quay Street sensors (107 Quay Street, 188 Quay Street Lower Albert EW and NS) plus Date and Time. How many columns does your subset have?

Task 9: Filtering Rows with Boolean Indexing

Boolean indexing is the pandas equivalent of the if statements you wrote in Part 2.1, but applied to entire columns at once.

# Where was 45 Queen Street count above 1000?
high_traffic = ped[ped["45 Queen Street"] > 1000]
print(f"Hours with >1000 pedestrians at 45 Queen St: {len(high_traffic)}")
Hours with >1000 pedestrians at 45 Queen St: 1968
# Multiple conditions: use & (and), | (or), with parentheses
busy_both = ped[
    (ped["45 Queen Street"] > 500) &
    (ped["30 Queen Street"] > 500)
]
print(f"Hours where BOTH Queen St sensors > 500: {len(busy_both)}")
Hours where BOTH Queen St sensors > 500: 4527
# The .query() method is an alternative syntax (often more readable)
result = ped.query("`45 Queen Street` > 1000 and `30 Queen Street` > 800")
print(f"Both high: {len(result)} rows")
Both high: 1964 rows
Backticks in .query()

When column names contain spaces, wrap them in backticks inside .query() strings: `45 Queen Street`. This is a common source of errors.

Your turn: Filter the dataset to find all rows where at least one sensor recorded more than 2,000 pedestrians in a single hour. Hint: you can compute a row maximum with ped[sensor_cols].max(axis=1) and use that as a filter condition.

Task 10: Creating New Columns

Adding computed columns transforms raw data into analysis-ready features.

# The CSV contains a "Daylight Savings" marker row and 11 blank separator rows.
# Drop all non-data rows before converting.
ped = ped.dropna(subset=["Date", "Time"]).copy()
ped = ped[ped["Date"] != "Daylight Savings"].copy()

# Convert Date to datetime type
ped["Date"] = pd.to_datetime(ped["Date"])

# Extract temporal features
ped["month"] = ped["Date"].dt.month
ped["day_of_week"] = ped["Date"].dt.day_name()
ped["day_num"] = ped["Date"].dt.dayofweek  # 0=Monday, 6=Sunday

# Extract hour from the Time column
ped["hour"] = ped["Time"].str.split(":").str[0].astype(int)

ped[["Date", "Time", "month", "day_of_week", "hour"]].head(10)
Date Time month day_of_week hour
0 2024-01-01 6:00-6:59 1 Monday 6
1 2024-01-01 7:00-7:59 1 Monday 7
2 2024-01-01 8:00-8:59 1 Monday 8
3 2024-01-01 9:00-9:59 1 Monday 9
4 2024-01-01 10:00-10:59 1 Monday 10
5 2024-01-01 11:00-11:59 1 Monday 11
6 2024-01-01 12:00-12:59 1 Monday 12
7 2024-01-01 13:00-13:59 1 Monday 13
8 2024-01-01 14:00-14:59 1 Monday 14
9 2024-01-01 15:00-15:59 1 Monday 15
# Create a total count across all sensors
ped["total_count"] = ped[sensor_cols].sum(axis=1)

print(f"Highest single-hour total: {ped['total_count'].max():,.0f}")
print(f"Lowest single-hour total: {ped['total_count'].min():,.0f}")
Highest single-hour total: 24,188
Lowest single-hour total: 101

Your turn: Create a boolean column called is_weekend that is True for Saturday and Sunday and False otherwise. Then filter to show only weekend rows and check how many there are.

Task 11: The Importance of .copy()

When you slice a DataFrame, pandas sometimes returns a view (a window into the original) rather than an independent copy. Modifying a view can accidentally change your original data.

# Safe: always use .copy() when creating subsets you will modify
january = ped[ped["month"] == 1].copy()
january["season"] = "Summer"

# The original ped DataFrame is unaffected
print(f"'season' in ped columns: {'season' in ped.columns}")
print(f"'season' in january columns: {'season' in january.columns}")
'season' in ped columns: False
'season' in january columns: True
Rule of Thumb

If you plan to add or modify columns on a subset, always call .copy() first. This avoids the SettingWithCopyWarning and prevents silent data corruption.

Task 12: Grouping and Aggregation

.groupby() splits data into groups, applies a function, and combines results. This is how you answer questions like “what is the average count per month?” or “which hour is busiest on weekdays?”

# Average hourly count by month for 45 Queen Street
monthly_avg = ped.groupby("month")["45 Queen Street"].mean()
print(monthly_avg.round(0))
month
1     587.0
2     644.0
3     599.0
4     596.0
5     583.0
6     542.0
7     591.0
8     567.0
9     546.0
10    563.0
11    619.0
12    575.0
Name: 45 Queen Street, dtype: float64
# Average by hour across all sensors
hourly_avg = ped.groupby("hour")[sensor_cols.tolist()].mean()
hourly_total = hourly_avg.sum(axis=1)
print(f"\nBusiest hour (all sensors): {hourly_total.idxmax()}:00")
print(f"Quietest hour (all sensors): {hourly_total.idxmin()}:00")

Busiest hour (all sensors): 17:00
Quietest hour (all sensors): 4:00
# Multiple aggregations at once
agg_stats = ped.groupby("month")["45 Queen Street"].agg(["mean", "median", "std", "min", "max"])
agg_stats.round(0)
mean median std min max
month
1 587.0 560.0 453.0 5.0 1572.0
2 644.0 640.0 486.0 7.0 1788.0
3 599.0 588.0 458.0 5.0 1843.0
4 596.0 510.0 472.0 4.0 1734.0
5 583.0 508.0 466.0 2.0 1693.0
6 542.0 450.0 446.0 3.0 1612.0
7 591.0 474.0 494.0 2.0 1788.0
8 567.0 456.0 462.0 3.0 1709.0
9 546.0 473.0 438.0 2.0 1638.0
10 563.0 532.0 441.0 4.0 1635.0
11 619.0 618.0 446.0 4.0 1903.0
12 575.0 623.0 423.0 1.0 2358.0

Your turn: Group by day_of_week and calculate the mean total_count. Which day has the highest average footfall across all sensors? Is it a weekday or weekend day?

Task 13: Reshaping Data (Wide to Long)

The CSV is in wide format: each sensor is its own column. Many analysis and plotting tasks work better with long format, where each row represents a single sensor-time observation.

# Melt from wide to long
ped_long = pd.melt(
    ped,
    id_vars=["Date", "Time", "month", "day_of_week", "day_num", "hour"],
    value_vars=sensor_cols.tolist(),
    var_name="location",
    value_name="count"
)

print(f"Wide shape: {ped.shape}")
print(f"Long shape: {ped_long.shape}")
ped_long.head()
Wide shape: (8783, 28)
Long shape: (184443, 8)
Date Time month day_of_week day_num hour location count
0 2024-01-01 6:00-6:59 1 Monday 0 6 107 Quay Street 78.0
1 2024-01-01 7:00-7:59 1 Monday 0 7 107 Quay Street 201.0
2 2024-01-01 8:00-8:59 1 Monday 0 8 107 Quay Street 644.0
3 2024-01-01 9:00-9:59 1 Monday 0 9 107 Quay Street 394.0
4 2024-01-01 10:00-10:59 1 Monday 0 10 107 Quay Street 861.0
# Now we can easily group by location
location_avg = ped_long.groupby("location")["count"].mean().sort_values(ascending=False)
print("Top 5 busiest locations (average hourly count):")
print(location_avg.head().round(0))
Top 5 busiest locations (average hourly count):
location
30 Queen Street     734.0
261 Queen Street    610.0
210 Queen Street    590.0
45 Queen Street     584.0
297 Queen Street    402.0
Name: count, dtype: float64

Your turn: Using ped_long, find the average hourly count for each location during weekday lunchtime hours (12:00 and 13:00, Monday to Friday). Which location is busiest at lunchtime?

Task 14: Handling Missing Data

Real sensor data almost always has gaps: equipment failures, maintenance windows, or transmission errors. After our earlier cleaning step (dropping blank rows and the Daylight Savings marker), this particular dataset happens to be complete. Let us verify that first, and then deliberately introduce gaps so we can practise imputation strategies you will need in other projects.

# Check for missing values across all sensor columns
missing = ped[sensor_cols].isna().sum()
print(f"Total missing cells: {missing.sum()}")
Total missing cells: 0

Good news: zero missing values. But you will rarely be this lucky. To practise imputation, we will create a copy of one sensor’s data and knock out a block of hours to simulate a sensor outage.

# Simulate a sensor outage: remove hours 10-15 on a busy weekday
sensor = "45 Queen Street"
sim = ped[["Date", "Time", "hour", sensor]].copy()

# Pick a Wednesday in March (a busy weekday)
target_date = pd.Timestamp("2024-03-06")
mask = (sim["Date"] == target_date) & (sim["hour"].between(10, 15))
print(f"Rows to blank out: {mask.sum()}")
print(f"True values we are hiding:")
print(sim.loc[mask, ["hour", sensor]].to_string(index=False))

# Replace with NaN
sim.loc[mask, sensor] = np.nan
Rows to blank out: 6
True values we are hiding:
 hour  45 Queen Street
   10            514.0
   11            647.0
   12            799.0
   13            869.0
   14           1130.0
   15           1159.0

Now we have a realistic gap: six consecutive hours of missing data during peak time, but we know the true values. Let us see how different imputation strategies recover them.

# Apply four strategies
sim["fill_zero"] = sim[sensor].fillna(0)
sim["fill_ffill"] = sim[sensor].ffill()
sim["fill_bfill"] = sim[sensor].bfill()
sim["fill_interp"] = sim[sensor].interpolate()
# Compare strategies for the gap period
gap_rows = sim[mask].copy()
gap_rows = gap_rows.rename(columns={sensor: "true_value"})

# Retrieve the true values from the original ped DataFrame
gap_rows["true_value"] = ped.loc[mask, sensor].values

comparison = gap_rows[["hour", "true_value", "fill_zero", "fill_ffill", "fill_bfill", "fill_interp"]]
comparison.columns = ["Hour", "True", "Zero", "Fwd Fill", "Bck Fill", "Interpolate"]
print(comparison.to_string(index=False))
 Hour   True  Zero  Fwd Fill  Bck Fill  Interpolate
   10  514.0   0.0     858.0    1280.0   918.285714
   11  647.0   0.0     858.0    1280.0   978.571429
   12  799.0   0.0     858.0    1280.0  1038.857143
   13  869.0   0.0     858.0    1280.0  1099.142857
   14 1130.0   0.0     858.0    1280.0  1159.428571
   15 1159.0   0.0     858.0    1280.0  1219.714286
# Compute the error for each strategy
for strategy in ["Zero", "Fwd Fill", "Bck Fill", "Interpolate"]:
    mae = (comparison[strategy] - comparison["True"]).abs().mean()
    print(f"  {strategy:12s}  Mean Absolute Error = {mae:,.0f}")
  Zero          Mean Absolute Error = 853
  Fwd Fill      Mean Absolute Error = 200
  Bck Fill      Mean Absolute Error = 427
  Interpolate   Mean Absolute Error = 216
# Visualise the gap and each strategy's attempt to fill it
import matplotlib.pyplot as plt

day = sim[sim["Date"] == target_date].copy()

fig, ax = plt.subplots(figsize=(10, 4))

# True values (from original data)
true_day = ped.loc[ped["Date"] == target_date, ["hour", sensor]]
ax.plot(true_day["hour"], true_day[sensor], "ko-", label="True", linewidth=2, zorder=5)

# Each strategy
ax.plot(day["hour"], day["fill_zero"], "s--", label="Fill 0", alpha=0.7)
ax.plot(day["hour"], day["fill_ffill"], "^--", label="Forward fill", alpha=0.7)
ax.plot(day["hour"], day["fill_bfill"], "v--", label="Backward fill", alpha=0.7)
ax.plot(day["hour"], day["fill_interp"], "D-", label="Interpolate", alpha=0.7)

# Shade the gap
ax.axvspan(9.5, 15.5, color="grey", alpha=0.12, label="Simulated gap")

ax.set_title(f"Imputation strategies: {sensor}, {target_date.date()}", fontweight="bold")
ax.set_xlabel("Hour of day")
ax.set_ylabel("Pedestrian count")
ax.legend(fontsize=9)
plt.tight_layout()
plt.show()

The black line shows the true values we hid. Filling with zero massively undercounts during peak hours. Forward fill flatlines at the 9 AM value, missing the lunchtime surge entirely. Backward fill flatlines at the 4 PM value, which overshoots the morning but undershoots lunch. Interpolation draws a straight line between 9 AM and 4 PM, which at least captures the general trend but still misses the midday peak.

Which Strategy?

There is no single correct answer. Filling with zero assumes sensor silence means no pedestrians, which can severely undercount during busy periods. Forward and backward fill create flat lines that ignore temporal patterns. Interpolation is often a reasonable middle ground for short gaps but struggles with longer ones where the underlying pattern is non-linear. For time series with strong daily rhythms (like pedestrian counts), more sophisticated approaches such as filling with the average for the same hour and day of week can perform better. Always document and justify your choice.

Your turn: Try a smarter imputation: for each missing hour, fill with the mean count for that same hour across all other Wednesdays. How does the error compare to simple interpolation?

Task 15: Merging DataFrames

In real analyses, you often need to combine data from multiple sources. This task simulates merging sensor metadata with count data.

# Create a simple metadata table
metadata = pd.DataFrame({
    "location": ["107 Quay Street", "45 Queen Street", "30 Queen Street",
                  "150 K Road", "183 K Road", "2 High Street"],
    "zone": ["Waterfront", "CBD Core", "CBD Core", "K Road", "K Road", "CBD Core"],
    "installed_year": [2019, 2018, 2018, 2020, 2020, 2019]
})

metadata
location zone installed_year
0 107 Quay Street Waterfront 2019
1 45 Queen Street CBD Core 2018
2 30 Queen Street CBD Core 2018
3 150 K Road K Road 2020
4 183 K Road K Road 2020
5 2 High Street CBD Core 2019
# Merge with long-format data
merged = ped_long.merge(metadata, on="location", how="inner")
print(f"Merged shape: {merged.shape}")
merged.head()
Merged shape: (52698, 10)
Date Time month day_of_week day_num hour location count zone installed_year
0 2024-01-01 6:00-6:59 1 Monday 0 6 107 Quay Street 78.0 Waterfront 2019
1 2024-01-01 7:00-7:59 1 Monday 0 7 107 Quay Street 201.0 Waterfront 2019
2 2024-01-01 8:00-8:59 1 Monday 0 8 107 Quay Street 644.0 Waterfront 2019
3 2024-01-01 9:00-9:59 1 Monday 0 9 107 Quay Street 394.0 Waterfront 2019
4 2024-01-01 10:00-10:59 1 Monday 0 10 107 Quay Street 861.0 Waterfront 2019
# Now we can group by zone
zone_avg = merged.groupby("zone")["count"].mean().sort_values(ascending=False)
print("Average hourly count by zone:")
print(zone_avg.round(0))
Average hourly count by zone:
zone
CBD Core      470.0
Waterfront    354.0
K Road        225.0
Name: count, dtype: float64
Join Types

how="inner" keeps only locations that appear in both DataFrames. Use how="left" to keep all rows from the left DataFrame (adding NaN where no match exists), how="right" for the opposite, and how="outer" to keep everything from both sides.

Your turn: Add more locations to the metadata table (at least 5 more sensors). Assign each to a zone. Re-run the merge and groupby to see how the zone averages change.


Part 2.3: Data Visualisation

With your data loaded, cleaned, and transformed, it is time to make it visual. Visualisation reveals patterns that summary statistics alone cannot convey.

Task 16: Setup and Line Plot

import matplotlib.pyplot as plt
import seaborn as sns

# Clean style for all plots
sns.set_style("whitegrid")
# Daily total for 45 Queen Street
daily_queen = ped.groupby("Date")["45 Queen Street"].sum()

fig, ax = plt.subplots(figsize=(10, 4))
daily_queen.plot(ax=ax, color="#065A82", linewidth=0.7, alpha=0.8)
ax.set_title("45 Queen Street: Daily Pedestrian Count (2024)", fontsize=14, fontweight="bold")
ax.set_xlabel("Date")
ax.set_ylabel("Daily Count")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

What patterns can you see? Look for weekly cycles (dips on weekends), seasonal trends (lower in winter), and any anomalous spikes or drops.

Your turn: Create the same plot for “107 Quay Street” and “150 K Road” on the same axes. Use different colours and add a legend. Do waterfront and K Road sensors show the same seasonal pattern as Queen Street?

Task 17: Bar Chart Comparing Locations

# Average hourly count per location
avg_by_location = ped[sensor_cols].mean().sort_values()

fig, ax = plt.subplots(figsize=(8, 7))
colours = ["#F96167" if v >= avg_by_location.nlargest(3).min() else "#065A82"
           for v in avg_by_location]
avg_by_location.plot.barh(ax=ax, color=colours)
ax.set_title("Average Hourly Pedestrian Count by Location", fontsize=14, fontweight="bold")
ax.set_xlabel("Average Hourly Count")
plt.tight_layout()
plt.show()

The three sensors with the highest average counts are highlighted in coral. Note the large variation across locations: some sensors record hundreds of people per hour while others see only a few dozen.

Your turn: Create a similar bar chart but showing the standard deviation of counts instead of the mean. Do the most variable locations correspond to the busiest ones?

Task 18: Heatmap (Hour vs Day of Week)

A heatmap is an excellent way to see two-dimensional patterns. Here we will plot average pedestrian counts by hour of day and day of week.

# Pivot table: hour (rows) x day_of_week (columns)
pivot = ped.pivot_table(
    values="45 Queen Street",
    index="hour",
    columns="day_of_week",
    aggfunc="mean"
)

# Reorder days logically
day_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
pivot = pivot[day_order]

fig, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(pivot, cmap="YlOrRd", annot=True, fmt=".0f", linewidths=0.5, ax=ax)
ax.set_title("45 Queen Street: Average Count by Hour and Day", fontsize=14, fontweight="bold")
ax.set_ylabel("Hour of Day")
ax.set_xlabel("Day of Week")
plt.tight_layout()
plt.show()

This heatmap reveals the temporal rhythm of the street: strong weekday lunchtime peaks, lower weekend activity, and near-zero counts in the early morning hours.

Your turn: Create the same heatmap for “150 K Road”. How does K Road’s temporal pattern differ from Queen Street? Does the weekend pattern differ?

Task 19: Box Plots by Month

fig, ax = plt.subplots(figsize=(10, 5))
month_names = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
               "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
ped["month_name"] = ped["month"].map(dict(zip(range(1, 13), month_names)))
month_order = month_names[:ped["month"].max()]

sns.boxplot(data=ped, x="month_name", y="45 Queen Street",
            order=month_order, palette="coolwarm", ax=ax)
ax.set_title("Monthly Distribution: 45 Queen Street", fontsize=14, fontweight="bold")
ax.set_xlabel("Month")
ax.set_ylabel("Hourly Count")
plt.tight_layout()
plt.show()
/var/folders/th/wzvst6957_v02h6ztkc7qxw5_3lzg7/T/ipykernel_35910/1570749334.py:7: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

Box plots show the distribution (median, quartiles, and outliers) for each month. This helps you see whether the variability changes seasonally, not just the average.

Your turn: Create box plots grouped by day_of_week instead of month. Use the order=day_order parameter to keep days in logical sequence.

Task 20: Multi-Panel Figure

Combining multiple plots into a single figure is essential for reports and presentations.

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Panel 1: Daily trend
daily_queen.plot(ax=axes[0, 0], color="#065A82", linewidth=0.6)
axes[0, 0].set_title("Daily Total (45 Queen St)", fontweight="bold")
axes[0, 0].set_xlabel("")

# Panel 2: Hourly average profile
hourly_profile = ped.groupby("hour")["45 Queen Street"].mean()
hourly_profile.plot.bar(ax=axes[0, 1], color="#1C7293", width=0.8)
axes[0, 1].set_title("Average Hourly Profile", fontweight="bold")
axes[0, 1].set_xlabel("Hour")
axes[0, 1].set_ylabel("Avg Count")

# Panel 3: Weekday vs Weekend comparison
weekday_mask = ped["day_num"] < 5
weekday_profile = ped[weekday_mask].groupby("hour")["45 Queen Street"].mean()
weekend_profile = ped[~weekday_mask].groupby("hour")["45 Queen Street"].mean()
axes[1, 0].plot(weekday_profile.index, weekday_profile.values, label="Weekday", color="#065A82", linewidth=2)
axes[1, 0].plot(weekend_profile.index, weekend_profile.values, label="Weekend", color="#F96167", linewidth=2)
axes[1, 0].legend()
axes[1, 0].set_title("Weekday vs Weekend Profile", fontweight="bold")
axes[1, 0].set_xlabel("Hour")
axes[1, 0].set_ylabel("Avg Count")

# Panel 4: Top 5 locations bar chart
top5 = ped[sensor_cols].mean().nlargest(5).sort_values()
top5.plot.barh(ax=axes[1, 1], color=["#1C7293", "#1C7293", "#1C7293", "#065A82", "#065A82"])
axes[1, 1].set_title("Top 5 Busiest Locations", fontweight="bold")
axes[1, 1].set_xlabel("Avg Hourly Count")

plt.suptitle("Auckland CBD Pedestrian Analysis (2024)", fontsize=16, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()

Your turn: Replace one of the four panels with a visualisation of your own design. Ideas: a scatter plot of weekday vs weekend averages by location, a stacked area chart of hourly counts, or a violin plot by month.


Final Challenge

Putting It All Together

Using everything from Parts 2.1, 2.2, and 2.3, produce a short analysis (in the same notebook) that answers the following question:

How do pedestrian patterns differ between weekdays and weekends across Auckland CBD, and do these differences vary by location type?

Your analysis should include:

  1. Data preparation (Part 2.1 & 2.2): Load the CSV, create temporal features, handle any missing values, and create a long-format version with zone metadata.

  2. At least three different visualisations (Part 2.3):

    • One comparing weekday vs weekend patterns over time (e.g., line plot or area chart)
    • One comparing locations or zones (e.g., grouped bar chart or heatmap)
    • One of your own design that adds insight
  3. Written interpretation: Below each figure, write 2-3 sentences explaining what the visualisation shows and what it means for understanding pedestrian behaviour in Auckland.

  4. A concluding paragraph summarising your key findings and suggesting one follow-up question that could be investigated with additional data (e.g., weather, events, public transport schedules).

Suggested Zone Classification

You can use or modify this categorisation for the merge:

Zone Sensors
Waterfront 107 Quay Street, Te Ara Tahuhu Walkway, 188 Quay St (both)
CBD Core 45 Queen Street, 30 Queen Street, 210 Queen Street, 205 Queen Street
Upper Queen 261 Queen Street, 297 Queen Street
K Road 150 K Road, 183 K Road
Side Streets Commerce Street West, 7 Custom Street East, 19 Shortland Street, 2 High Street, 1 Courthouse Lane, 61 Federal Street, 59 High Stret, 8 Darby Street (both)

Time: approximately 20 minutes


What You Have Learnt

This lab covered the complete data analysis workflow from raw CSV to publication-quality figures:

  1. Part 2.1 refreshed your Python fundamentals (variables, lists, loops, conditionals) using Auckland pedestrian sensor data as context.
  2. Part 2.2 introduced pandas DataFrames for reading, exploring, filtering, grouping, reshaping, handling missing data, and merging.
  3. Part 2.3 demonstrated how to create line plots, bar charts, heatmaps, box plots, and multi-panel figures using matplotlib and seaborn.

These skills form the foundation for Section 2.4, where you will add the spatial dimension through GeoPandas, and for the Auckland Footfall Analytics assignment.