Sizing AI/ML Projects: A Repeatable Method That Tracks Reality

12 min read

20 hours ago

Press enter or click to view image in full size

Not the perfect estimate – a practical, repeatable methodology that has held up surprisingly well against what projects actually cost

Why I am sharing this

I have been sizing software, data, and AI/ML projects for the last 25 years. The technology keeps changing; the way estimates go wrong does not. Someone opens a document, lists the components, puts an hours guess next to each, sums the column, and sends it. By the time reality arrives, the number is fiction.

What follows is the method I keep coming back to for AI/ML work. I am not claiming it is the best one on the market. I am claiming something more useful: it is repeatable, it can be explained to a business lead and an ML engineer in the same sitting, and over many projects it has tracked the actual outcome surprisingly well.

That is a low bar that most estimates fail.

The method is deliberately conservative, and it fits in one line:

Business outcomes ► actionable outcomes ► model graphs by cadence ► unique model inventory ► size each model

Press enter or click to view image in full size

Read it left to right. Models appear fourth, not first. Everything to the left of them is a business conversation; everything to the right is a sizing exercise. The discipline is to do the steps in order and not skip to the models, because that is the step that makes the number defensible later.

Start from the business KPI, not the model

The first step has no maths in it. With the client you pick the two or three KPIs to move — manufacturing first-pass yield, fleet fuel efficiency, warehouse pick accuracy, flock growth rate for a poultry producer. Nothing else gets sized until these are fixed, because they are the target everything works back from.

Then you work backwards. For each KPI ask one question: what concrete action or decision, taken during operation, would move it? These are the actionable outcomes — “pull a defective part”, “re-route a delivery”, “throttle a machine before it overheats”, “adjust feed for an underperforming flock”. A KPI never improves on its own; it improves because something acts.

Those acts are what the system has to produce.

Press enter or click to view image in full size

This is the step teams skip, and it is the one that protects the estimate. A model that does not feed an actionable outcome is a model nobody asked for. If you cannot draw the line from a model back to an action back to a KPI, it should not be in the estimate.

Cadence turns the estimate into a schedule

Once you know the actionable outcomes, you draw how data flows to produce each one: data sources -> features -> models -> actionable outcome. You do not draw them on one page. You separate the graphs by how often the loop runs, because cadence drives latency, compute, and how much data you move.

Realtime — per event, sub-second
Perpetual — every few minutes
Daily
Weekly, per-cycle, or another period

Press enter or click to view image in full size

A model that runs every minute and one that runs once a production cycle are different engineering problems even when the maths is identical. Keeping the cadences as separate graphs makes that explicit before it becomes a deployment surprise.

The graphs do one more thing: they expose dependencies. The arrows show which model feeds which. A model that consumes another’s output can only be built after it; models on separate branches can be built in parallel. That view turns a pile of estimates into a schedule — what runs concurrently, what waits, and where adding people actually shortens delivery. A chain of dependent models sets the critical path that caps how far you can compress the timeline.

Press enter or click to view image in full size

The data contract is where projects actually slip

The left edge of every graph is a data source. Before sizing a single model, you catalogue every source the system will consume — each listed once, with what it yields and whether it is ready.

Press enter or click to view image in full size

This is the least glamorous part of the method and the most important. Data sources dominate the data-prep and integration cost, and missing or unready data is the single most common reason a build runs late. The model architecture is rarely the bottleneck; the cost and the risk live in the data — calibration, labelling effort, missing or inconsistent records, timestamp drift, environment variation, and the long tail of edge cases.

So we treat the catalogue as a data contract agreed with the customer, not a one-sided assumption. For each source we capture where it comes from, what it measures, how it is obtained and how often, its format and volume, and — the load-bearing field — its readiness: accessible, labelled, clean, and consistent, or needs work first. Mark each source ready, partial, or missing. That single flag is the difference between a best-judgement guess and a committed estimate. Where a source later falls short of the contract, the dependent effort is at risk and gets re-sized openly, not silently absorbed.

Now, and only now, list the models

With the graphs drawn, collect every distinct model once. The same model often appears in several graphs; count it once. For each, record what it does in plain terms, its type, the actionable outcomes it feeds, the data it needs, its inference cadence, and roughly how much labelled data it takes to train versus how much it processes in operation.

Press enter or click to view image in full size

This inventory is the bridge between the two halves of the estimate. The effort sizing uses the build attributes; the compute sizing uses the inference cadence and volumes. One record, two downstream uses.

Before any model gets sized, it needs a quality gate: one specific metric with a target value, not a vague sense of “good enough”. MAPE for a forecaster, mIoU or Dice for segmentation, precision/recall or F1 for a detector. The acronyms do not matter; agreeing one number and a target does. The target is the Minimal Viable Accuracy — the lowest value that still enables the actionable outcome, checked against ground truth agreed up front.

Press enter or click to view image in full size

Here is the rule that saves projects: do not target above the state of the art. Beating the published best is research, not delivery — high-cost and not guaranteed. If an outcome genuinely needs above-standard accuracy, that is a client go/no-go, not a cost line you quietly absorb. Build to the gate, not to optimal. Some models can be rule-based and still pass.

Sizing: relative, not absolute

Now you size — and even here you do not guess hours from a blank page. You estimate relative to an anchor, on a Fibonacci scale.

First, qualify an anchor: one model an experienced ML engineer can size confidently from past work, given a concrete effort and a t-shirt size — XS 1, S 2, M 3, L 5, XL 8, XXL 13. Then every other model is sized by a single question: is it the same as a qualified model, one step easier, or N steps harder? Each step moves one rung up or down the ladder.

Press enter or click to view image in full size

Fibonacci is not decoration. Its widening gaps mirror how uncertainty grows with complexity — small models are predictable, large ones are not — which is exactly why it is the standard agile estimation scale (Mike Cohn, Mountain Goat Software). It captures uncertainty without pretending to a precision nobody has.

And you do not size one “build the model” number. You size each model across five workstreams, because that is where the hidden work lives.

Press enter or click to view image in full size

Model development — architecture, training code, baselines, packaging
Experiments & feature engineering — feature design and selection (encodings, spectral, lag, interaction features), sweeps over settings and model choices, false-alarm tuning, robustness
Data prep and labelling — extraction, cleaning, annotation, QA, label reconciliation
Integration and pipeline — ingestion, storage, deployment, APIs, dashboards
Validation and field iteration — real-environment review, user feedback, drift checks, re-test

Each estimate is tagged Low, Med, or High confidence. Integration and validation depend most on data readiness, so they carry the widest error.

Get Konrad "Stellars" Jelen’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

Feature engineering spans three of these on purpose: design and selection sit in Experiments, raw extraction and cleaning in Data prep, and the production feature pipeline in Integration — sizing it in one bucket and not the others keeps it from being double-counted.

Four multipliers, acting on different things

A raw size is not the final number. Four multipliers adjust it, and the useful trick is that each acts on a different workstream, so they layer rather than blindly compound. The two complexity multipliers are Fibonacci-grade-driven — grades 2 / 3 / 5 / 8 divided by 3, so Standard (grade 3) = 1.0.

They layer rather than blindly compound.

Press enter or click to view image in full size

Model type scales development and experiments, because novelty is risk:

Press enter or click to view image in full size

Dataset complexity scales data prep — Trivial ~0.67x, Standard 1.0x, Hybrid ~1.67x, Complex ~2.67x (Fibonacci grades 2/3/5/8 divided by 3)

Feature-engineering complexity scales experiments — Minimal ~0.67x (learned features, most vision/audio), Standard 1.0x (one-hot encoding, scaling, rolling stats — standard techniques), Heavy ~1.67x (spectral + lag
+ interaction features, real selection search), Research-grade ~2.67x (novel construction + large combinatorial selection)

Delivery tier scales the whole model. Because type hits build, complexity hits data prep, and tier hits the whole, they layer cleanly — you still sanity-check the composed total against a comparable past build. “Unique” is the research case: flag it, do not promise it.

Three delivery tiers

A prototype and a deployed product are different jobs. Price each model at three tiers; size the Field PoC directly and scale the other two off it.

Press enter or click to view image in full size

Demonstrator (~0.68x) — just enough end to end on clean, curated data; shows the solution is likely viable; accuracy is not yet the goal
Field PoC (1x, baseline) — integrated in the real environment; accurate enough that gains are visible and demonstrated on site; the sized anchor
Production pilot (~3.3x) — hardened and dependable; full MLOps, monitoring, drift detection, retraining, A/B testing, HA, security, documentation
Basis — the production multiple tracks Brooks’ ~3x program-to-systems-product rule (Mythical Man-Month) and the finding that the model is ~5% of a real ML system, the rest data, serving, monitoring, and glue (Sculley et al., NeurIPS 2015)
Ratios, not invention — 0.68x and 3.3x are Fibonacci rung ratios (S:M = 0.667; two L:M steps ≈ 3.33)
Quote a range — each tier total is a band, not a point; the single value is just the most common spot

One thread, end to end

It helps to watch one example travel the whole pipeline. Take a manufacturer wanting to cut scrap on a production line.

Business outcome: reduce scrap rate
Actionable outcome: flag a defective part so an operator pulls it before it ships
Graph and cadence: line cameras -> defect classifier -> “reject” flag; runs per part, every few seconds; depends on an upstream image-correction model, so it is built after it
Model record: defect classifier; type custom-standard; vision data; inference per part; about 2,000 labelled images to train
Quality gate: precision and recall high enough that an operator trusts the reject, set to approach published surface-defect results — not beyond them
Size: model dev S, experiments M, data prep L for the annotation, integration S, validation M -> a Field-PoC hour total; custom-standard ~1x, FE complexity Minimal ~0.67x (the CNN learns the features)
Tiers: that Field-PoC total is the baseline; the demonstrator is about 0.68x, production about 3.3x

Press enter or click to view image in full size

One model, one unbroken line of sight from the KPI to the hours. Every model gets the same treatment, and they sum to a tier total.

The other half of the bill: run cost

Everything above sizes the build. There is a second number that surprises teams late: the run — the compute to operate the models in production, and which hardware serves it most cheaply. It uses the inference cadence and data attributes you already captured.

Press enter or click to view image in full size

You compute the forward-pass FLOPs for one inference at a standard input size, multiply by how often the model runs to get monthly FLOPs, then express everything per unit of business — one production line, one 10 km square tile, one supplier. Get the unit wrong and nothing scales. Sum across models, multiply by units, and you have total monthly FLOPs for the whole case. Then you compare that demand against a fixed set of standardised compute stacks, so every project is costed the same way.

Press enter or click to view image in full size

Throughput is quoted in FP32 on purpose — it is the conservative floor. Production inference in FP16 or INT8 runs several times faster, so a fleet sized on FP32 carries built-in headroom; you treat quantisation as margin, not as a reason to down-size. Convert total monthly FLOPs into a required throughput, divide by a card’s usable throughput at a conservative 30% utilisation, round up, and you have a GPU count and a cost. Take the cheapest feasible option and add 25% contingency for the final figure. That 25% is the only buffer in the whole method, and it is explicit — nothing else is padded.

What this method does not do

It is worth being honest about the edges.

It outputs hours, not price. Cost is hours times your rate card and role mix, worked out separately. Hardware, licences, hosting, and project management are not in these numbers unless stated
The estimate is best-judgement until the data is checked. A data-readiness assessment is what turns sizing into a commitment
Sizing is a data-science judgement, not a clerical one. It needs an engineer who knows the model type, the required accuracy, and the data. A spreadsheet cannot do it
The multipliers are rules of thumb anchored on past builds, not constants. Sanity-check the composed total against a comparable project every time
“Unique” models are research. Flag them as a go/no-go, never as a cost line you can simply absorb

Use it, adapt it

I will say again what this method is and is not. It is not the most sophisticated estimation framework you can find. It is a repeatable one — the same steps, in the same order, producing a number two different readers can trust and, over 25 years of doing this, one that has tracked reality far better than the list-of-models guess it replaced.

The principles underneath it are short. Be outcome-first — never size a model that does not trace back to an agreed KPI. Be conservative by default — surface the hidden work, do not assume the first version works. Keep every buffer explicit. And remember where the risk actually lives: not in the model, but in the data reality around it.

If your next estimate starts with a list of models, you are sizing from the wrong end. Agree the outcomes first, walk the steps in order, and the honest number follows.

If you need to know more, feel free to contact me

Konrad Jelen is a data scientist and CTO specializing in AI solutions for manufacturing, finance and market research