How planners, multi-agent workflows, routing logic, and task coordination help AI agents operate at production scale
9 min read
May 1, 2026
--
Press enter or click to view image in full size
In Part 1, we built the foundation of an AI agent: defining purpose, designing prompts, and selecting the right model. Those decisions determine what your agent should think.
In Part 2A, we explored tools — the systems, APIs, databases, and workflows that allow an agent to take action. Those capabilities determine what your agent can do.
In Part 2B, we explored memory — the layer that helps agents learn from history, preserve context, and improve decisions over time. That determines what your agent can remember.
But even intelligent agents with tools and memory can still fail in production. Why?
Because real business work is rarely a single step. A simple maintenance alert may require inventory checks, technician assignment, production rescheduling, approval workflows, safety validation, vendor coordination, and final reporting.
The agent may have every required capability. But without coordination, everything breaks. Tasks happen in the wrong order. Critical steps are skipped. Systems conflict. Delays multiply.
Intelligence without orchestration creates chaos.
This is why modern AI systems need planners, routers, workflow logic, retry systems, state management, multi-agent collaboration, and decision control layers.
Orchestration ensures the right action happens at the right time, in the right sequence, under the right conditions.
Tools give agents power. Memory gives them context. Orchestration gives them discipline.
This is the execution layer that transforms isolated capabilities into reliable autonomous systems.
In Part 2C, we explore how production AI agents coordinate complex workflows across banking, healthcare, retail, logistics, manufacturing, and enterprise operations.
This is part of the series Building Production AI Agents: A Complete Architecture Guide, where we walk through an 8-step framework to take agents from concept to deployment, with practical patterns and examples across banking, healthcare, retail, manufacturing, and beyond.
Step 6: Orchestration
Orchestration coordinates workflows, routes requests, handles errors, and manages agent-to-agent communication. It transforms individual tools and memory into coherent autonomous behavior.
Routing and Decision Trees
Routing determines which tools to call, in which order, based on context and goals.
A banking fraud agent follows a routing tree: Check transaction amount. If over $10,000, immediately flag for review. Otherwise, check merchant category. If high-risk category, run deep analysis. Otherwise, check customer behavior. If inconsistent with patterns, investigate further. Otherwise, approve.
async def route_fraud_check(transaction: dict) -> str:
"""Route fraud check based on transaction characteristics.""" # High-value transactions always get review
if transaction["amount"] > 10000:
return "manual_review"
# Check merchant risk category
merchant_data = await check_merchant_reputation(
transaction["merchant_id"]
)
if merchant_data["risk_score"] > 0.7:
return "deep_analysis"
# Check customer behavior consistency
behavior_analysis = await analyze_customer_behavior(
transaction["customer_id"],
transaction
)
if behavior_analysis["anomaly_score"] > 0.5:
return "investigation"
# Low-risk transaction
return "approve"
A healthcare triage agent routes based on symptoms: Red flag symptoms route immediately to emergency protocol. Acute symptoms route to urgent care assessment. Chronic symptoms route to primary care scheduling. Administrative questions route to simple information lookup.
Routing can be rule-based (if-then logic) or model-based (ML classifier determines routing). Rule-based routing is more predictable and auditable. Model-based routing handles more complex patterns.
Triggers and Events
Triggers initiate agent actions automatically based on events: time-based schedules, threshold alerts, external signals, or system events.
A manufacturing predictive maintenance agent monitors sensor data continuously. When vibration exceeds threshold, trigger anomaly investigation workflow. When failure probability exceeds 80%, trigger maintenance scheduling workflow. Daily at midnight, trigger batch analysis of all equipment health metrics.
class TriggerOrchestrator:
def __init__(self):
self.triggers = {} def register_threshold_trigger(
self,
metric: str,
threshold: float,
action: callable
):
"""Register trigger for metric threshold."""
self.triggers[metric] = {
"type": "threshold",
"threshold": threshold,
"action": action
}
def register_scheduled_trigger(
self,
schedule: str,
action: callable
):
"""Register time-based trigger."""
self.triggers[schedule] = {
"type": "scheduled",
"action": action
}
async def check_triggers(self, metrics: dict):
"""Evaluate triggers and execute actions."""
for metric, value in metrics.items():
if metric in self.triggers:
trigger = self.triggers[metric]
if trigger["type"] == "threshold":
if value > trigger["threshold"]:
await trigger["action"](metric, value)
A retail pricing agent triggers repricing workflows when competitors change prices, when inventory ages past threshold, or on scheduled optimization runs three times daily.
An agriculture agent triggers irrigation when soil moisture drops below threshold, triggers pest alerts when disease detected in satellite imagery, and triggers harvest planning when crop maturity estimated at 90%.
Parameter Management
Orchestration requires configuring thresholds, timeouts, priorities, and operational parameters. These parameters determine agent behavior and should be tunable without code changes.
from dataclasses import dataclass@dataclass
class OrchestrationConfig:
# Fraud detection thresholds
high_value_threshold: float = 10000.0
merchant_risk_threshold: float = 0.7
anomaly_score_threshold: float = 0.5 # Timeouts
api_timeout_seconds: float = 10.0
database_timeout_seconds: float = 5.0
# Retry configuration
max_retries: int = 3
retry_backoff_seconds: float = 2.0
# Rate limits
max_requests_per_minute: int = 100
# Resource limits
max_concurrent_investigations: int = 50
Store configurations in environment variables, configuration files, or feature flag systems. This enables tuning agent behavior in production without deployments.
For healthcare agents, parameters include triage score thresholds, appointment booking windows, and escalation criteria. For manufacturing agents, parameters include sensor thresholds, maintenance scheduling constraints, and quality tolerance levels.
Message Queues and Async Processing
Many workflows do not need synchronous responses. Fraud investigations, batch processing, report generation, and data analysis can run asynchronously.
Message queues decouple request submission from processing. A retail agent receives inventory optimization request, puts it in queue, returns confirmation to user. Background worker processes queue, runs optimization, and notifies on completion.
import asyncio
from typing import Callableclass MessageQueue:
def __init__(self):
self.queue = asyncio.Queue()
self.workers = [] async def enqueue(self, task: dict):
"""Add task to queue."""
await self.queue.put(task)
async def worker(self, process_func: Callable):
"""Process tasks from queue."""
while True:
task = await self.queue.get()
try:
await process_func(task)
except Exception as e:
logger.error(f"Task processing failed: {e}")
finally:
self.queue.task_done()
def start_workers(self, process_func: Callable, num_workers: int = 3):
"""Start background workers."""
for _ in range(num_workers):
worker = asyncio.create_task(self.worker(process_func))
self.workers.append(worker)
Aviation maintenance agents use message queues for scheduling optimization. Maintenance requests are queued, batch optimizer runs periodically to find optimal scheduling considering aircraft availability, crew qualifications, parts availability, and production impact.
Banking fraud agents use queues for investigation workflow. Suspicious transactions trigger investigations added to queue. Investigation workers process queue, gathering data, running analysis, generating reports, and escalating as needed.
Agent-to-Agent Communication
Complex systems benefit from multiple specialized agents coordinating together. Each agent handles a specific domain with its own prompts, tools, and models.
A comprehensive retail system might include separate agents for inventory optimization, pricing strategy, customer service, fraud detection, and supplier management. These agents communicate through shared message bus or direct API calls.
class AgentCoordinator:
def __init__(self):
self.agents = {} def register_agent(self, name: str, agent: Any):
"""Register specialized agent."""
self.agents[name] = agent
async def delegate_task(self, agent_name: str, task: dict) -> dict:
"""Delegate task to specialized agent."""
if agent_name not in self.agents:
raise ValueError(f"Unknown agent: {agent_name}")
agent = self.agents[agent_name]
result = await agent.process(task)
return result
async def coordinate_workflow(self, workflow: dict) -> dict:
"""Execute multi-agent workflow."""
results = {}
for step in workflow["steps"]:
agent_name = step["agent"]
task = step["task"]
# Add previous results to context
task["context"] = results
# Delegate to agent
result = await self.delegate_task(agent_name, task)
results[step["name"]] = result
return results
A banking fraud system coordinates multiple agents: transaction analyzer identifies anomalies, merchant evaluator assesses merchant risk, customer behavior analyst checks consistency, compliance checker verifies regulatory requirements, and case manager orchestrates overall investigation.
Healthcare systems coordinate triage agent, scheduling agent, clinical decision support agent, and patient communication agent. Each handles its domain with appropriate tools and knowledge.
Error Handling and Recovery
Production orchestration requires comprehensive error handling. Services fail. APIs timeout. Databases become unavailable. Networks partition.
Implement retry logic with exponential backoff for transient failures. Circuit breakers stop calling failing services. Fallback logic provides degraded functionality when systems are unavailable.
class CircuitBreaker:
def __init__(self, failure_threshold: int = 5, timeout: int = 60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.last_failure_time = None
self.state = "closed" # closed, open, half_open async def call(self, func: Callable, *args, **kwargs):
"""Call function with circuit breaker protection."""
if self.state == "open":
if time.time() - self.last_failure_time > self.timeout:
self.state = "half_open"
else:
raise Exception("Circuit breaker open")
try:
result = await func(*args, **kwargs)
if self.state == "half_open":
self.state = "closed"
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
raise e
For critical workflows, implement compensating transactions. If multi-step process fails midway, roll back completed steps. If payment processing fails after inventory reservation, release the reserved inventory.
Log all errors with context for debugging. Include request parameters, system state, and error details. This enables root cause analysis and system improvement.
Multi-Industry Orchestration Examples
Banking: Fraud Investigation Workflow
- Receive transaction alert from monitoring system (trigger)
- Route based on amount and merchant category (routing)
- Gather data in parallel: transaction history, merchant reputation, customer behavior (parallel tool calls)
- Calculate fraud risk score (processing)
- If high risk, create investigation case and escalate to analyst (routing)
- If medium risk, flag for batch review (message queue)
- If low risk, approve transaction and log decision (completion)
- Update fraud models with outcome (learning loop)
Error handling: If merchant reputation API fails, proceed with available data and flag for manual review. If case management system unavailable, queue case creation for retry.
Get Raj kumar’s stories in your inbox
Join Medium for free to get updates from this writer.
Remember me for faster sign in
Retail: Inventory Optimization Workflow
- Scheduled trigger runs optimization at 6 AM daily
- Retrieve current inventory levels across all locations (SQL query)
- Retrieve sales forecasts from demand prediction agent (agent delegation)
- For each SKU, calculate optimal stock levels and pricing (batch processing)
- Check supplier APIs for lead times and availability (parallel API calls)
- Generate purchase orders for items below reorder point (tool call)
- Calculate markdown pricing for slow-moving items (processing)
- Submit pricing updates to POS systems (API call)
- Generate daily inventory report for management (file creation)
Error handling: If supplier API unavailable, use cached lead times. If POS system rejects pricing update, queue for retry and alert pricing team.
Healthcare: Patient Triage Workflow
- Patient initiates triage conversation (trigger)
- Collect symptoms through structured questionnaire (episodic memory)
- Check for red flag symptoms (routing)
- If red flags detected, immediately classify as emergency and route to emergency protocol (critical path)
- Otherwise, retrieve patient medical history (SQL query)
- Search clinical guidelines for symptom assessment (vector database)
- Calculate triage score using approved algorithm (processing)
- Route to appropriate care level based on score (routing)
- Check appointment availability (API call)
- Book appointment if appropriate (tool call)
- Send confirmation and instructions to patient (communication)
Error handling: If EHR system unavailable, proceed with triage based on patient-reported information and flag for manual chart review. Never delay emergency care due to system failures.
Manufacturing: Predictive Maintenance Workflow
- Continuous monitoring of equipment sensor data (trigger)
- When anomaly detected, analyze historical maintenance records (SQL query)
- Run failure prediction model on sensor data (ML model call)
- If failure probability exceeds threshold, initiate maintenance planning (routing)
- Check parts inventory and order if needed (API calls)
- Coordinate with production scheduling agent for maintenance window (agent delegation)
- Find qualified maintenance crew availability (database query)
- Create maintenance work order with specifications (tool call)
- Assign crew and schedule maintenance (orchestration)
- Monitor execution and update equipment logs (tracking)
Error handling: If parts unavailable, escalate to procurement team and flag as urgent. If no qualified crew available, contact external service provider. Never defer safety-critical maintenance.
Agriculture: Crop Disease Response Workflow
- Daily satellite imagery analysis detects anomaly (trigger)
- Correlate with ground sensor data (data fusion)
- Search crop disease database for symptom match (vector database)
- If disease match found, retrieve treatment recommendations (knowledge retrieval)
- Check weather forecast for treatment timing (API call)
- Calculate treatment cost and expected benefit (analysis)
- Generate treatment recommendation with specific location and timing (report generation)
- Send SMS alert to farmer with actionable instructions (communication)
- Schedule follow-up imagery analysis in 7 days (future trigger)
- Update disease tracking database (database update)
Error handling: If satellite imagery unavailable, rely on ground sensor data. If disease database unreachable, recommend general inspection and escalate to agricultural extension service. Always provide actionable guidance even with incomplete data.
Closing Thoughts: Orchestration Patterns for Reliable Autonomous AI
Tools provide capability. Memory provides context. Orchestration transforms both into dependable execution.
This is the layer where AI systems stop behaving like isolated assistants and start operating like reliable business engines. It ensures the right task happens in the right order, with the right controls, under the right conditions.
Without orchestration, capability becomes chaos. With orchestration, complexity becomes scalable.
The right orchestration model depends on your environment. A lightweight internal workflow may only need simple routing logic. A production-grade enterprise system may require planners, retries, approval chains, fallback paths, multi-agent collaboration, monitoring, and governance controls.
Start simple. Scale only when complexity demands it.
Over-engineered systems fail as often as under-designed ones.
In Part 3, we move from infrastructure into production reality: Building AI Agents Part 3: User Interface, Testing Strategy, and Framework Selection.
You’ll learn how to design AI experiences users trust, evaluate agent performance before failures reach production, and choose frameworks that match your business goals, technical constraints, and industry needs.
Great AI agents are not just intelligent. They are usable, measurable, and deployable.
If this series is helping you, I’d genuinely appreciate your support. Clap, comment, and share so more builders can learn how real AI systems are created.
What orchestration patterns are working for your use case today — workflow engines, planners, routers, multi-agent systems, or custom logic? Share your thoughts below.
This is part of the series Building Production AI Agents: A Complete Architecture Guide — an 8-step path from architecture to production deployment, with practical examples across banking, healthcare, retail, manufacturing, and beyond.
Follow me for Part 3: Taking your AI agent from prototype to production.
