Member-only story
13 min read
8 hours ago
--
A model at a major ride-sharing company once shipped with a feature computed from future trip data. Offline AUC looked exceptional. Production metrics quietly degraded for six weeks before anyone noticed. That’s the defining property of training-pipeline bugs — they don’t announce themselves. This article walks a single training row through the five DAG stages that separate a notebook from a production system, naming the specific bug each stage is built to prevent.
Part 3 — https://medium.com/p/31dea07f3b81/edit
The Core Insight First
A training pipeline is not a training script. A training script is model.fit() with a shell scheduler around it. A training pipeline is five stages of a directed acyclic graph, each containerized, each idempotent, each versioned, each instrumented for the moment something goes wrong at hour three of a four-hour job.
Press enter or click to view image in full size
Press enter or click to view image in full size
One definition, five stages, one lineage. The pipeline is the contract between your data team and your production system. Skip any stage and the bug you catch will cost weeks, not minutes.
