Hernán Galileo

Building a Geospatial Transit API for Mexico City

2026-04-22T00:00:00+00:00

The problem

Mexico City has one of the world’s largest public transit systems — Metro, Metrobús, Cablebús, Tren Ligero — but its official data is scattered across PDFs, shapefiles, and inconsistent web portals. If you want to answer a spatial question like “which Metro stations are within 800m of a Metrobús corridor?”, you’d normally spend hours just wrangling the raw data.

For my Master’s research in Urban Planning at UNAM, I needed a clean, queryable representation of this network. So I built Apimetro.

Architecture overview

The stack is intentionally simple:

Flask (REST API)
  └── GeoPandas (spatial dataframes)
  └── PostGIS (persistent spatial storage)
  └── GeoJSON (wire format)

All transit geometries (routes as LineStrings, stations as Points) are stored in PostGIS with proper SRID 4326 projections. Flask endpoints expose them as GeoJSON, which means any GIS client — QGIS, Leaflet, Mapbox, deck.gl — can consume them without transformation.

Spatial SQL in action

One of the most useful queries is finding transit interchanges — places where two different lines are within walking distance:

SELECT 
  a.station_name,
  a.line,
  b.station_name AS nearby_station,
  b.line AS nearby_line,
  ST_Distance(
    a.geom::geography,
    b.geom::geography
  ) AS distance_m
FROM metro_stations a
JOIN metrobus_stations b
  ON ST_DWithin(a.geom::geography, b.geom::geography, 400)
WHERE a.line != b.line
ORDER BY distance_m;

PostGIS’s ST_DWithin with ::geography cast handles the geodesic distance calculation correctly — important in a city at ~2,200m elevation where metric accuracy matters.

What GeoPandas brings to the table

Before pushing to PostGIS, I process raw shapefiles with GeoPandas:

import geopandas as gpd
from shapely.ops import unary_union

metro = gpd.read_file("metro_stations.shp")
metro = metro.to_crs(epsg=4326)          # normalize projection
metro["buffer"] = metro.geometry.buffer(0.004)  # ~400m buffer
metro.to_postgis("metro_stations", engine, if_exists="replace")

The CRS normalization step is critical — the source data often comes in EPSG:6372 (Mexico’s national projection) and needs to be converted before spatial joins with OSM or GTFS data.

Lessons learned

1. Always validate your geometry before inserting

PostGIS will accept invalid geometries but spatial functions will fail silently or return wrong results. Always run ST_IsValid() after imports.

2. GeoJSON is your best friend for APIs

Don’t try to serialize geometries as WKT in JSON — just use GeoJSON natively. Flask-SQLAlchemy + GeoAlchemy2 can serialize PostGIS geometries to GeoJSON automatically.

3. Spatial indexes are not optional

A GIST index on geometry columns turns a 30-second ST_DWithin scan into a 200ms lookup on a table with 150,000 transit stops.

CREATE INDEX idx_metro_geom ON metro_stations USING GIST(geom);

What’s next

The next phase is integrating GTFS feed data for real-time frequency analysis — answering not just where stations are but how often each line runs and what the effective coverage area is at different time windows.

If you’re working on urban mobility data or transit APIs, check out the project:
github.com/galigaribaldi/Apimetro

VFT Model — Report 00: Building a Topological Transit Graph

2026-04-22T00:00:00+00:00

Report 00: Building a Topological Transit Graph for Mexico City

The foundational challenge in transit network analysis is deceptively simple: how do you turn a collection of GPS coordinates, stops, and route shapes into a computable graph? The Vanishing Fig-Tree Model (VFT Model) addresses this in its first report by constructing a directed topological representation of Mexico City’s multimodal transit network.

Full technical results: Reporte 00 — VFT Model Notebooks

The Problem: From GTFS to a Graph

GTFS (General Transit Feed Specification) feeds describe transit systems as sequences of stops and route geometries — not as connected graph topologies. Two stations on different lines may be physically a few meters apart, but appear as entirely separate nodes in the raw data. This is the phantom node problem: without resolving it, any graph-based analysis produces broken or disconnected paths.

Consider the Pantitlán interchange, where Metro Lines 1, 5, 9, and A converge. In raw GTFS data, each line registers its own stop coordinates independently. Without spatial preprocessing, Pantitlán appears as four separate nodes with no edges between them — analytically invisible as a transfer hub.

Logical Snapping

The solution implemented in Report 00 is logical snapping: a spatial preprocessing step that merges nodes within a configurable distance threshold (ε). Unlike exact coordinate matching, this algorithm handles GPS noise and inconsistent data entry gracefully:

Build a spatial index (R-tree) over all stop coordinates
Identify all node pairs within ε meters of each other
Collapse each cluster into a single representative node, preserving all incoming and outgoing edge connections

The threshold ε is tuned per transport mode. Metro stations use a tighter ε than surface-level RTP stops, which have higher GPS variance.

Building the Directed Graph

With phantom nodes resolved, each transit line becomes a sequence of directed edges in a networkx.DiGraph:

Nodes: transit stops — attributes include coordinates, system (Metro, Metrobús, Cablebús…), and line identifier
Edges: service segments between consecutive stops — weighted by scheduled travel time in seconds

The resulting graph covers the full CDMX multimodal network: Metro, Metrobús, Cablebús, Tren Ligero, Trolebús, Mexicable, and Interurbano.

Why This Matters

A correctly-built topological graph is the prerequisite for every subsequent analysis in the VFT Model: computing betweenness centrality, measuring direct-route indices (DI), and simulating ring-corridor scenarios. A phantom node left unresolved corrupts path-finding across the entire network.

Report 00 establishes the foundation that Reports 01–05 build on.

This post summarizes findings from the VFT Model research project, part of the TAICMAM thesis at UNAM. For the full notebook with code, visualizations, and methodology detail, see the VFT Model GitHub Pages.

Reporte 00: Construcción del Grafo Topológico de la Red de Transporte de la CDMX

El desafío fundamental en el análisis de redes de transporte es engañosamente simple: ¿cómo se transforma una colección de coordenadas GPS, paradas y trazas de rutas en un grafo computable? El Modelo VFT (Modelo del Punto de Higuera) aborda esta pregunta en su primer reporte construyendo una representación topológica dirigida de la red multimodal de transporte de la Ciudad de México.

Resultados técnicos completos: Reporte 00 — Notebooks del Modelo VFT

El Problema: De GTFS a un Grafo

Los feeds GTFS (General Transit Feed Specification) describen sistemas de transporte como secuencias de paradas y geometrías de rutas — no como topologías de grafo conectadas. Dos estaciones de líneas diferentes pueden estar físicamente a pocos metros de distancia, pero aparecer como nodos completamente separados en los datos crudos. Este es el problema de nodos fantasma: sin resolverlo, cualquier análisis basado en grafos produce caminos rotos o desconectados.

Tomemos el caso del Pantitlán, donde convergen las Líneas 1, 5, 9 y A del Metro. En los datos GTFS crudos, cada línea registra sus propias coordenadas de parada de forma independiente. Sin preprocesamiento espacial, Pantitlán aparece como cuatro nodos separados sin aristas entre ellos — analíticamente invisible como hub de transferencia.

Snapping Lógico

La solución implementada en el Reporte 00 es el snapping lógico: un paso de preprocesamiento espacial que fusiona nodos dentro de un umbral de distancia configurable (ε). A diferencia de la coincidencia exacta de coordenadas, este algoritmo maneja de forma robusta el ruido GPS y la inconsistencia en la captura de datos:

Construir un índice espacial (R-tree) sobre todas las coordenadas de paradas
Identificar todos los pares de nodos dentro de ε metros entre sí
Colapsar cada cluster en un único nodo representativo, preservando todas las conexiones de aristas entrantes y salientes

El umbral ε se calibra por modo de transporte. Las estaciones de Metro usan un ε más ajustado que las paradas de superficie del RTP, que tienen mayor varianza GPS.

Construcción del Grafo Dirigido

Con los nodos fantasma resueltos, cada línea de transporte se convierte en una secuencia de aristas dirigidas en un networkx.DiGraph:

Nodos: paradas de transporte — atributos: coordenadas, sistema (Metro, Metrobús, Cablebús…) e identificador de línea
Aristas: segmentos de servicio entre paradas consecutivas — ponderados por tiempo de viaje programado en segundos

El grafo resultante cubre la red multimodal completa de la CDMX: Metro, Metrobús, Cablebús, Tren Ligero, Trolebús, Mexicable e Interurbano.

Por Qué Importa

Un grafo topológico correctamente construido es el prerrequisito para todo análisis posterior en el Modelo VFT: calcular la centralidad de intermediación, medir el Índice de Ruta Directa (DI) y simular escenarios de corredores anillares. Un nodo fantasma sin resolver corrompe la búsqueda de caminos en toda la red.

El Reporte 00 establece la base sobre la que se construyen los Reportes 01 al 05.

Este post resume los hallazgos del proyecto de investigación Modelo VFT, parte de la tesis TAICMAM en la UNAM. Para el notebook completo con código, visualizaciones y detalle metodológico, consulta las GitHub Pages del Modelo VFT.