2.3. Geospatial Data
Introduction
Geographic Information Systems (GIS) combine spatial data (where things are) with attribute data (what things are). In this section, you’ll learn the fundamentals of working with geospatial data in Python, focusing on vector data analysis using the geopandas library.
What is GIS?
Roger Tomlinson coined the term “GIS” (Geographic Information Systems), defining it as:
A computer system for capturing, storing, checking, and displaying data related to positions on Earth’s surface.
All data in GIS are georeferenced, meaning each feature has:
- Attribute information: What it is (name, type, value, etc.)
- Spatial information: Where it is (coordinates, location)
GIS Applications
GIS enables us to answer complex spatial questions. For example, with crime data we can explore:
- Where are the most vulnerable communities located?
- Why do crimes occur in one area and not another?
- How do offenders travel to crime locations?
- Where are there more or fewer stop and searches than expected?
Types of Maps
Reference Maps
Reference maps are used to communicate location and static data points:
- Purpose: To ‘pin point’ data on a map
- Nature: Descriptive
- Examples: Road maps, topographic maps, navigation tools
Thematic Maps
Thematic maps highlight spatial relationships and patterns:
- Purpose: To ‘study a theme’ within spatial data
- Nature: Explanatory
- Examples: Choropleth maps, heat maps, cartograms
Transport maps like Auckland’s rail network can be both:
- Reference: They show station locations and line routes
- Thematic: They can reveal patterns in life expectancy, poverty, or house prices along transport corridors
Geospatial Data Types
Vector Data
Vector data represents discrete features using points, lines, and polygons:
- Points: Individual locations (cities, sensors, trees)
- Lines: Linear features (roads, rivers, routes)
- Polygons: Areas with boundaries (parcels, countries, lakes)
Vector data is preferred in social sciences because human settlements and processes tend to have discrete borders.
Raster Data
Raster data represents continuous surfaces using a grid of pixels:
- Each pixel has a value
- Commonly used for: satellite imagery, elevation models, temperature surfaces
- Dominated in environmental sciences due to remote sensing
File Formats
Vector Formats
- Shapefile (.shp): Traditional GIS format (requires multiple files: .shp, .shx, .dbf, .prj)
- GeoJSON (.geojson): Web-friendly, human-readable format
- GeoPackage (.gpkg): Modern, single-file format supporting multiple layers
- KML/KMZ: Google Earth format
Raster Formats
- GeoTIFF (.tif): Standard format with embedded spatial information
- NetCDF: Multi-dimensional arrays (climate data)
- HDF5: Hierarchical data format
Coordinate Reference Systems (CRS)
Understanding Projections
The Earth is a 3D sphere (technically an oblate spheroid), but maps are 2D. This creates distortion.
Common CRS systems:
- EPSG:4326 (WGS84): Unprojected geographic coordinates (latitude/longitude in degrees)
- EPSG:3857: Web Mercator (used by Google Maps, metres)
- EPSG:2193: NZTM (New Zealand Transverse Mercator, metres)
Always check and match CRS when combining datasets. Mixing CRS will give incorrect results.
Working with GeoPandas
Installation and Setup
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import Point, LineString, PolygonReading Spatial Data
# Read from various formats
countries = gpd.read_file('countries.shp')
cities = gpd.read_file('cities.geojson')
regions = gpd.read_file('regions.gpkg')
# Read from URL
url = 'https://example.com/data.geojson'
gdf = gpd.read_file(url)Exploring GeoDataFrames
# View first few rows
countries.head()
# Check the CRS
countries.crs
# Get column information
countries.columns
# Summary statistics
countries.describe()
# Access geometry column
countries.geometryBasic Operations
Filtering Data
# Filter by attribute
oceania = countries[countries['continent'] == 'Oceania']
# Filter by spatial relationship
# Select countries that contain a specific point
auckland = Point(174.76, -36.85)
nz = countries[countries.contains(auckland)]Creating Geometries
# Create a point
point = Point(174.76, -36.85)
# Create a line
line = LineString([(174.76, -36.85), (174.77, -36.86)])
# Create a polygon
polygon = Polygon([(0, 0), (1, 0), (1, 1), (0, 1)])
# Create a GeoDataFrame from scratch
gdf = gpd.GeoDataFrame(
{'name': ['Auckland'],
'population': [1672000]},
geometry=[Point(174.76, -36.85)],
crs='EPSG:4326'
)Spatial Relationships
GeoPandas provides functions to test spatial relationships:
contains(): Does geometry A contain geometry B?within(): Is geometry A within geometry B?intersects(): Do geometries A and B intersect?touches(): Do geometries share a boundary but not area?crosses(): Do geometries cross each other?overlaps(): Do geometries overlap partially?disjoint(): Are geometries completely separate?
Example: Finding Which Country Contains a City
# Create a point for New York
new_york = Point(-74.006, 40.7128)
# Find the country containing New York
country_with_ny = countries[countries.contains(new_york)]
print(country_with_ny['name'].values) # Output: ['United States']
# Test if New York is within the USA
usa = countries[countries['name'] == 'United States'].geometry.iloc[0]
new_york.within(usa) # Returns: TrueSpatial Joins
Spatial joins combine attributes from two datasets based on spatial relationships:
# Join cities to countries based on spatial location
joined = gpd.sjoin(
cities, # Left GeoDataFrame
countries, # Right GeoDataFrame
predicate='within', # Spatial relationship
how='left' # Join type (left, right, inner)
)
# Now cities have country attributes
print(joined[['city_name', 'country_name', 'continent']])Common Spatial Join Use Cases
- Assigning census data to geographic units
- Finding which administrative area a point falls within
- Counting features within polygons
- Aggregating data by spatial location
Coordinate Transformations
Reprojecting Data
# Check current CRS
print(gdf.crs)
# Reproject to a different CRS
gdf_projected = gdf.to_crs('EPSG:3857') # Web Mercator
# Reproject to match another dataset
gdf_matched = gdf.to_crs(other_gdf.crs)- For distance/area calculations: Use a projected CRS (in metres)
- For web mapping: Use EPSG:3857
- For spatial operations: Ensure all datasets have the same CRS
Geometric Operations
Buffering
Create a zone around features:
# Create 10km buffer around cities (data must be in metres)
cities_projected = cities.to_crs('EPSG:3857')
cities_buffered = cities_projected.copy()
cities_buffered['geometry'] = cities_buffered.buffer(10000) # 10km in metresDissolve
Combine features based on an attribute:
# Dissolve countries by continent
continents = countries.dissolve(by='continent', aggfunc='sum')Overlay Operations
Combine geometries from two datasets:
# Intersection: areas where both datasets overlap
intersection = gpd.overlay(gdf1, gdf2, how='intersection')
# Union: all areas from both datasets
union = gpd.overlay(gdf1, gdf2, how='union')
# Difference: areas in gdf1 not in gdf2
difference = gpd.overlay(gdf1, gdf2, how='difference')Visualisation
Basic Plotting
# Simple plot
countries.plot()
# Customised plot
fig, ax = plt.subplots(figsize=(10, 10))
countries.plot(ax=ax,
column='pop_est', # Colour by population
cmap='YlOrRd', # Colour scheme
legend=True, # Show legend
edgecolor='black') # Border colour
ax.set_title('World Population by Country')
plt.show()Choropleth Maps
# Create a classified choropleth map
countries.plot(column='gdp_per_capita',
scheme='quantiles', # Classification scheme
k=5, # Number of classes
cmap='viridis',
legend=True,
legend_kwds={'title': 'GDP per Capita (Quantiles)'})Multiple Layers
# Plot multiple layers together
fig, ax = plt.subplots(figsize=(12, 8))
# Base layer
countries.plot(ax=ax, color='lightgrey', edgecolor='black')
# Add cities
cities.plot(ax=ax, color='red', markersize=20, alpha=0.6)
ax.set_title('World Countries and Major Cities')
ax.set_axis_off()
plt.show()The Modifiable Areal Unit Problem (MAUP)
MAUP refers to the sensitivity of statistical results to the choice of spatial units:
- Scale Effect: Changing the size of spatial units affects results
- Zoning Effect: Different arrangements of the same-sized units produce different results
When aggregating or analysing spatial data, remember that your choice of geographic units can influence your findings. Always consider whether your results might change with different spatial aggregations.
Practical Exercise
Example: Analysing New Zealand Regions
# Load NZ regional data
nz_regions = gpd.read_file('nz_regions.gpkg')
# Check CRS
print(nz_regions.crs)
# Reproject to NZTM for accurate measurements
nz_regions = nz_regions.to_crs('EPSG:2193')
# Calculate area in square kilometres
nz_regions['area_km2'] = nz_regions.geometry.area / 1_000_000
# Calculate population density
nz_regions['pop_density'] = (nz_regions['population'] /
nz_regions['area_km2'])
# Find regions within 100km of Auckland
auckland_point = gpd.GeoDataFrame(
{'name': ['Auckland']},
geometry=[Point(1756673, 5917750)], # NZTM coordinates
crs='EPSG:2193'
)
# Buffer Auckland
auckland_buffer = auckland_point.buffer(100000) # 100km
# Spatial join
nearby_regions = nz_regions[
nz_regions.intersects(auckland_buffer.iloc[0])
]
# Visualise
fig, ax = plt.subplots(figsize=(10, 12))
nz_regions.plot(ax=ax, column='pop_density',
cmap='YlOrRd', legend=True)
auckland_buffer.plot(ax=ax, facecolor='none',
edgecolor='blue', linewidth=2)
plt.title('NZ Population Density and 100km Buffer from Auckland')
plt.show()Summary
In this section, you’ve learned:
- Core GIS concepts and the difference between vector and raster data
- How to work with geospatial data using geopandas
- Coordinate reference systems and why they matter
- Spatial relationships and how to test them
- Common spatial operations: buffers, dissolves, overlays
- How to create effective maps and visualisations
- The importance of considering spatial scale (MAUP)
Further Reading
- GeoPandas Documentation
- Shapely Manual
- Geocomputation with Python: https://py.geocompx.org/
- Geographic Data Science with Python: https://geographicdata.science/book
Practice Exercises
- Load a shapefile of your choice and explore its attributes and geometry
- Reproject the data to at least two different CRS and observe the differences
- Create a 5km buffer around all points in a dataset
- Perform a spatial join to combine two datasets
- Create a choropleth map showing a variable of interest with a suitable classification scheme