6 min read
Just now
--
Multi-agent systems break in interesting ways. After running 100,000 words through five chained LLM agents, here is what I wish someone had told me.
Multi-agent AI systems are everywhere right now, but most of what gets written about them is conceptual. Diagrams of agents handing off to other agents. Toy demos that work once on a clean input. Architectures that look elegant on slides and fall apart the moment you point them at production traffic.
I have been running a five-agent content generation pipeline in production for several months, built on N8N and OpenRouter, generating SEO content for a SaaS business across dozens of niche verticals. After processing somewhere north of 100,000 words and watching the system break in roughly seven different ways, here is what I have learned.
The architecture
The pipeline has five chained agents, each with a single responsibility:
- Keyword research agent. Takes a seed vertical (for example, “CRM for kitchen fitters”) and expands it into a cluster of related keywords with intent classification. Uses DeepSeek V3 because the task is high-volume and quality-tolerant.
- Outline agent. Takes the keyword cluster and the seed and produces a structured H2/H3 outline with target word counts per section. Uses Claude Sonnet for better structural reasoning.
- Drafting agent. Takes the outline section by section and generates draft content. Each section is a separate API call to keep context windows manageable and isolate failures. Uses DeepSeek V3 again for cost efficiency at this volume.
- Editing agent. Reviews the assembled draft for factual coherence, removes hallucinated statistics, tightens prose. This is the most expensive agent because quality at this stage determines whether the output ships or gets thrown away. Uses Claude Opus.
- Metadata agent. Generates SEO title, meta description, image prompt, schema markup recommendations. Uses GPT-4o-mini because the output structure is rigid and the task is well-defined.
The whole pipeline runs in N8N with HTTP nodes pointing at OpenRouter, which lets me swap models per agent without changing infrastructure.
Press enter or click to view image in full size
Lesson 1: context loss is the silent killer
The most pervasive failure mode in chained agent systems is context loss. Agent 1 produces output that Agent 2 interprets slightly differently than intended. By the time Agent 5 receives the artifact, it has drifted meaningfully from the original brief.
The naive fix is to pass the original brief through every stage. This works until your context windows balloon and your cost per article triples.
The fix that actually works is structured handoff schemas. Every agent emits and consumes JSON with explicitly defined fields. The drafting agent does not receive “the outline” as freeform text; it receives:
json
{
"section_id": "h2_03",
"heading": "Common attribution mistakes",
"target_word_count": 250,
"must_include_keywords": ["UTM persistence", "offline conversion"],
"tone": "practical, no hype",
"previous_section_summary": "Discussed why ad-platform attribution misses post-click revenue."
}Structured schemas force each agent to produce machine-readable output, which forces specificity, which kills drift.
Lesson 2: temperature is a per-agent decision
I shipped the first version of this pipeline with temperature 0.7 across all five agents. Output was creative, occasionally inspired, and frequently unreliable.
Get Alex Ashcroft’s stories in your inbox
Join Medium for free to get updates from this writer.
Remember me for faster sign in
The right setting varies wildly per agent:
- Keyword research: 0.3. You want consistent, structured output, not creative reinterpretation.
- Outline: 0.5. Some variation is good, structural correctness matters more.
- Drafting: 0.7. Creativity helps prose feel less formulaic.
- Editing: 0.2. You want conservative changes, not creative rewriting.
- Metadata: 0.4. Structured output with light variation across articles.
Getting these right took weeks of iteration. Run a single agent at a single temperature 50 times on the same input. Compare the distribution of outputs. If they cluster, your temperature is too low; if they vary wildly in quality, it’s too high.
Lesson 3: hallucinations compound across stages
A single agent hallucinating a statistic is recoverable. Five agents passing that statistic through escalating contexts is catastrophic. By the editing stage, the model treats “the average UK marketing agency spends £14,000 monthly on Google Ads” as established fact and writes confident prose around it.
The mitigation I landed on is a fact-extraction step between drafting and editing. The pipeline pulls every quantitative claim from the draft and routes it to a separate validation agent with explicit instructions:
You will receive a list of factual claims. For each claim, respond with:
- "verifiable": claim can be supported by public data
- "common_knowledge": claim is widely accepted, no specific source needed
- "unverifiable": specific number with no clear source
- "likely_hallucination": claim contradicts publicly available informationDo not attempt to verify. Only categorise.Anything tagged “unverifiable” or “likely_hallucination” gets rewritten by the editing agent into qualitative language. “Studies have shown” replaces “73 percent of agencies.” It is not glamorous but it makes the difference between content that holds up and content that embarrasses you.
Lesson 4: cost monitoring is non-negotiable
OpenRouter makes model switching easy, which means cost can grow silently. My pipeline drifted from roughly £0.18 per article to £0.84 per article over six weeks without my noticing, because I had toggled the drafting agent to a more expensive model for “just one test” and forgotten to revert it.
Two safeguards that prevent this:
Per-agent cost ceiling alerts. Each N8N workflow logs token usage and dollar cost per stage to a Supabase table. A weekly summary emails me the cost-per-article trend. The first time it spikes, I know within seven days, not seven weeks.
Model lock-in via environment variables. I no longer hardcode model names in N8N nodes. Each agent reads its model from a config table, which means rolling back a costly experiment is one row update, not a manual hunt through twelve workflows.
Lesson 5: the editing agent is where you spend most of your quality budget
If you have £10 of API budget per article, spend £6 of it on the editing agent. Not the drafting agent. Drafts can be mediocre and recoverable. Bad editing produces published mediocrity.
The editing agent should run with explicit instructions about what to remove (hallucinated stats, hedging language, AI tells like “in conclusion” and “in today’s fast-paced world”), what to preserve (structural integrity, key terms, voice), and what to enhance (specific examples, concrete numbers when verifiable, transition logic between sections).
In my current pipeline, the editing agent is the only stage running on Claude Opus. Every other stage runs on cheaper models. The output quality is materially better than running drafting on Opus and editing on a cheaper model, because editing is fundamentally a quality-control task and quality control is exactly what frontier models are best at.
When this approach makes sense
Multi-agent content pipelines work when you have a high volume of similar but distinct outputs to produce, when each piece needs to be coherent on its own, and when you can tolerate some variance in quality across outputs.
They are wrong when you need a single highly polished output, when the subject matter requires domain expertise no model has, or when the human edit pass is going to be larger than the model output anyway.
The pipeline I described above runs as part of Odal, powering ongoing SEO content for lead generation verticals. The total infrastructure cost across hundreds of articles is comfortably under £100 per month, including all model API costs and N8N hosting.
It is not magic. It is a five-stage system with five failure modes that took two months to stabilise. But once stable, it produces consistent, useful content at a marginal cost per article that nothing else I have tried comes close to.
If you are building something similar, the five lessons above will save you weeks. Especially the editing one. Always the editing one.
