We audited 23 enterprise agentic AI projects in early 2025. Nineteen used supervisor as their primary orchestration pattern. Fourteen of those nineteen had workloads where 60% or more of tasks followed a deterministic sequence that would have been dramatically faster and cheaper as a pipeline. Those teams were paying 2-4x token cost penalties and seeing 2-3x slower response times for no reason other than defaulting to the pattern with the best GitHub documentation.
Orchestration pattern selection is the highest-leverage architectural decision in a multi-agent system. Get it wrong and no amount of prompt engineering recovers the cost and latency you've burned into the design. Get it right and the same underlying models deliver dramatically different business outcomes.
We tested four patterns against compliance review workloads using Amazon Bedrock. Here's what the data actually shows.
The Four Patterns and When Each Wins
| Dimension | Supervisor | Pipeline | Debate | Broadcast |
|---|---|---|---|---|
| Task determinism | Low (dynamic) | High (fixed steps) | Low (ambiguous) | Medium (known sources) |
| Latency | Slowest (serial + reasoning) | Fastest for sequential | 60–80% overhead | Fast (parallel, bounded by slowest) |
| Token cost | High (coordinator overhead) | Lowest | 2–3× baseline | Proportional to agent count |
| Error tolerance | Medium | Low (propagation risk) | Highest accuracy | Medium (aggregation dependent) |
| Scaling ceiling | 5 agents before bottleneck | Unlimited (linear chain) | 2–4 debaters max | 10+ agents feasible |
| Best for | Dynamic triage, exception handling | Document processing, ETL | High-stakes decisions | Multi-source research, enrichment |
Pattern 1: Supervisor — Amazon Bedrock Agent with Action Groups
The supervisor pattern uses a coordinator agent that dynamically selects which specialist agents to invoke based on the task at hand. On Amazon Bedrock, this maps directly to a Bedrock Agent with Action Groups, where each action group represents a specialist capability.
Best for: Tasks where the processing path isn't known in advance — exception handling, dynamic triage, cases where the agent must reason about which sub-task to execute next.
Key weakness: Past five concurrent delegations, the coordinator becomes a serial bottleneck. We measured 4x response time spikes when supervisor agents tried to manage more than five parallel specialists.
import boto3
import json
bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name='us-east-1')
def invoke_supervisor_agent(
agent_id: str,
agent_alias_id: str,
session_id: str,
task: str
) -> str:
"""
Invoke a Bedrock Agent acting as supervisor.
The agent's action groups define the available specialist capabilities.
"""
response = bedrock_agent_runtime.invoke_agent(
agentId=agent_id,
agentAliasId=agent_alias_id,
sessionId=session_id,
inputText=task,
enableTrace=True # Capture reasoning chain for observability
)
completion = ""
for event in response.get("completion"):
if "chunk" in event:
completion += event["chunk"]["bytes"].decode()
return completion
# Cap delegations per coordinator invocation to avoid bottlenecks
MAX_DELEGATIONS = 5Recommended guardrail: Cap token budgets per delegation by setting max tokens in the agent's foundation model configuration. Uncapped supervisor agents can run up $40-60 in token costs on complex multi-hop reasoning before returning a result.
Pattern 2: Pipeline — Amazon Bedrock Flows with Timeout-Based Lane Splitting
The pipeline pattern processes tasks through a fixed sequence of specialized agents, where each agent's output becomes the next agent's input. On Amazon Bedrock, this maps to Bedrock Flows (Prompt Flows) for managing the sequential handoffs, with AWS Step Functions handling conditional routing between lanes.
Performance: 3.2x throughput improvement versus supervisor for contract review — 14.6 seconds per document versus 47 seconds with a supervisor coordinator.
Failure mode: One slow agent blocks all downstream processing. Implement timeout-based lane splitting via Step Functions to route slow cases to an alternative path rather than letting them block the queue.
import boto3
import json
import time
bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name='us-east-1')
sfn = boto3.client('stepfunctions', region_name='us-east-1')
def run_pipeline_flow(flow_id: str, flow_alias_id: str, input_data: dict) -> dict:
"""
Execute a Bedrock Flow (pipeline) with timeout detection.
Falls back to Step Functions slow-lane if extraction exceeds threshold.
"""
start = time.time()
response = bedrock_agent_runtime.invoke_flow(
flowIdentifier=flow_id,
flowAliasIdentifier=flow_alias_id,
inputs=[{
"content": {"document": json.dumps(input_data)},
"nodeName": "FlowInputNode",
"nodeOutputName": "document"
}]
)
result = {}
for event in response.get("responseStream"):
if "flowOutputEvent" in event:
result = json.loads(event["flowOutputEvent"]["content"]["document"])
elapsed = time.time() - start
if elapsed > 10.0:
# Route slow cases to supervisor fallback via Step Functions
result["lane"] = "slow"
result["elapsed_ms"] = int(elapsed * 1000)
else:
result["lane"] = "fast"
return result
def route_to_step_function(state_machine_arn: str, payload: dict) -> str:
response = sfn.start_execution(
stateMachineArn=state_machine_arn,
input=json.dumps(payload)
)
return response["executionArn"]The pipeline wins whenever tasks follow predictable sequences. Contract review is a canonical example: extract clauses, classify clause types, score risk, generate summary. Each step has a known input schema and produces a known output schema. There's no dynamic replanning required. Forcing this into a supervisor pattern costs you 47 seconds instead of 14.6 seconds per document, every document, at scale.
Pattern 3: Debate — Cross-Model on Amazon Bedrock InvokeModel
The debate pattern fans out the same task to two agents built on different model families, then uses a third model as a judge to resolve disagreements. The cross-model approach is critical: using the same model family for both debaters reduces the disagreement rate to 8%, cutting the accuracy benefit significantly.
Accuracy improvement: 34% more reasoning errors caught versus single-agent baseline — 31 misclassifications versus 47 across 600 ambiguous clauses.
Cost: Token usage roughly triples. Latency increases 60–80%. Use debate only where the error cost justifies it: high-stakes decisions, fraud detection, compliance determinations.
import boto3
import json
import asyncio
from concurrent.futures import ThreadPoolExecutor
bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')
def invoke_model_sync(model_id: str, prompt: str, document: str) -> str:
"""Synchronous Bedrock InvokeModel call for thread pool execution."""
if "anthropic" in model_id:
body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 2000,
"messages": [{"role": "user", "content": f"{prompt}
{document}"}]
}
response = bedrock.invoke_model(modelId=model_id, body=json.dumps(body))
return json.loads(response["body"].read())["content"][0]["text"]
else:
# Amazon Nova / other models
body = {
"messages": [{"role": "user", "content": [{"text": f"{prompt}
{document}"}]}]
}
response = bedrock.invoke_model(modelId=model_id, body=json.dumps(body))
return json.loads(response["body"].read())["output"]["message"]["content"][0]["text"]
async def run_debate(document: str, debater_prompt_a: str, debater_prompt_b: str, judge_prompt: str) -> dict:
"""
Fan out to two different model families for cross-model debate.
Cross-model disagreement rate: 23% (vs 8% same-family).
Judge resolves correctly 89% of the time.
"""
loop = asyncio.get_event_loop()
executor = ThreadPoolExecutor(max_workers=2)
# Different model families reduce shared-bias false agreements
analysis_a, analysis_b = await asyncio.gather(
loop.run_in_executor(
executor, invoke_model_sync,
"anthropic.claude-3-5-sonnet-20241022-v2:0",
debater_prompt_a, document
),
loop.run_in_executor(
executor, invoke_model_sync,
"amazon.nova-pro-v1:0",
debater_prompt_b, document
)
)
# Judge with a third model to avoid bias
judgment = invoke_model_sync(
"anthropic.claude-opus-4-7:0",
judge_prompt,
f"Analysis A:
{analysis_a}
Analysis B:
{analysis_b}"
)
return {
"judgment": judgment,
"analysis_a": analysis_a,
"analysis_b": analysis_b,
"models_used": [
"claude-3-5-sonnet-v2",
"nova-pro-v1",
"claude-opus-4-7 (judge)"
]
}For the insurance claims system processing 40,000 claims per month, the debate tier on claims over $50,000 prevented an estimated $380,000 in annual overpayments. The 2-4 minute processing time for that 5% of high-value claims was economically justified by the error cost prevented.
Pattern 4: Broadcast — Parallel Bedrock Agents via Lambda Fanout
The broadcast pattern dispatches the same query to multiple specialist agents simultaneously, then aggregates their outputs. On Amazon Bedrock, this combines Bedrock Agents for each specialist with AWS Lambda for the fanout and aggregation layer.
Real-world result: Prospect research system broadcasting to six specialist agents simultaneously — LinkedIn signals, SEC filings, news sentiment, patent activity, Glassdoor culture signals, CRM history — reduced processing from 12 minutes to 90 seconds.
import boto3
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Callable
bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name='us-east-1')
SPECIALIST_AGENTS = {
"linkedin": {"agent_id": "AGENT_ID_1", "alias_id": "ALIAS_1"},
"sec_filings": {"agent_id": "AGENT_ID_2", "alias_id": "ALIAS_2"},
"news": {"agent_id": "AGENT_ID_3", "alias_id": "ALIAS_3"},
"patents": {"agent_id": "AGENT_ID_4", "alias_id": "ALIAS_4"},
"glassdoor": {"agent_id": "AGENT_ID_5", "alias_id": "ALIAS_5"},
"crm": {"agent_id": "AGENT_ID_6", "alias_id": "ALIAS_6"},
}
def invoke_specialist(specialist_name: str, config: dict, query: str, session_id: str) -> tuple[str, str]:
response = bedrock_agent_runtime.invoke_agent(
agentId=config["agent_id"],
agentAliasId=config["alias_id"],
sessionId=f"{session_id}-{specialist_name}",
inputText=query
)
result = ""
for event in response.get("completion"):
if "chunk" in event:
result += event["chunk"]["bytes"].decode()
return specialist_name, result
def broadcast_and_aggregate(query: str, session_id: str) -> dict:
"""
Fan out to all specialists in parallel, aggregate results.
Bounded by slowest agent — monitor p95 latency per specialist.
"""
results = {}
with ThreadPoolExecutor(max_workers=len(SPECIALIST_AGENTS)) as executor:
futures = {
executor.submit(invoke_specialist, name, config, query, session_id): name
for name, config in SPECIALIST_AGENTS.items()
}
for future in as_completed(futures, timeout=30):
specialist_name, result = future.result()
results[specialist_name] = result
return resultsConflict resolution: Six specialist agents frequently return contradictory signals. Explicit conflict resolution rules in the aggregation agent are mandatory — without them, you get a synthesis that quietly ignores contradictions. Define resolution priority order in the aggregation agent's system prompt: CRM data beats news sentiment for factual claims; SEC filings beat Glassdoor for financial metrics.
Hybrid Orchestration: Step Functions State Machine for Claims Processing
The production-grade approach combines all three patterns through AWS Step Functions, routing each request to the appropriate tier based on confidence scores and claim characteristics.
import boto3
import json
sfn = boto3.client('stepfunctions', region_name='us-east-1')
# Step Functions state machine definition (simplified)
STATE_MACHINE_DEFINITION = {
"Comment": "Multi-pattern claims routing",
"StartAt": "ExtractAndScore",
"States": {
"ExtractAndScore": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:ACCOUNT:function:ExtractClaimData",
"Next": "RoutingChoice"
},
"RoutingChoice": {
"Type": "Choice",
"Choices": [
{
"And": [
{"Variable": "$.claim_value", "NumericLessThanEquals": 50000},
{"Variable": "$.fraud_indicators", "NumericLessThanEquals": 2},
{"Variable": "$.confidence", "NumericGreaterThanEquals": 0.85}
],
"Next": "PipelineTier"
},
{
"Or": [
{"Variable": "$.claim_value", "NumericGreaterThan": 50000},
{"Variable": "$.fraud_indicators", "NumericGreaterThan": 2}
],
"Next": "DebateTier"
}
],
"Default": "SupervisorTier"
},
"PipelineTier": {
"Type": "Task",
"Resource": "arn:aws:states:::bedrock:invokeFlow",
"Parameters": {
"FlowIdentifier": "BEDROCK_FLOW_ID",
"FlowAliasIdentifier": "BEDROCK_FLOW_ALIAS"
},
"End": True
},
"SupervisorTier": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:ACCOUNT:function:InvokeSupervisorAgent",
"End": True
},
"DebateTier": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:ACCOUNT:function:RunCrossModelDebate",
"End": True
}
}
}
def start_claim_workflow(claim_data: dict) -> str:
response = sfn.start_execution(
stateMachineArn='arn:aws:states:us-east-1:ACCOUNT:stateMachine:ClaimsOrchestration',
input=json.dumps(claim_data)
)
return response["executionArn"]This hybrid approach handled 40,000 claims per month across the three tiers: 72% through Bedrock Flows pipeline (8.3 seconds average), 23% through Bedrock Agent supervisor (34 seconds average), and 5% through cross-model debate (2.1 minutes average). The debate tier's $380,000 in prevented overpayments justified the processing overhead.
CloudWatch Monitoring: Three Metrics That Matter
Track these three signals in CloudWatch to detect when your orchestration is drifting:
Tier distribution. If pipeline percentage drops below 65%, upstream data quality has degraded — extraction agents are returning lower confidence scores, pushing more traffic to supervisor. This is the earliest warning signal for data pipeline problems before they become user-visible failures.
Handoff failure rate. Claims timing out in the supervisor tier indicate that routing threshold (currently set to confidence ≥ 0.85 for pipeline) needs adjustment. When supervisor tier grows beyond 30% of traffic, re-examine whether you've set the confidence threshold too conservatively.
Debate agreement rate. Agreement above 90% in the debate tier means the tier is overutilized — you're paying 3x token cost for cases where a single agent would have reached the same conclusion. Recalibrate routing to push more agreement-prone cases back to supervisor.
import boto3
cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')
def emit_routing_metrics(tier: str, latency_ms: float, claim_value: float):
cloudwatch.put_metric_data(
Namespace='ClaimsOrchestration',
MetricData=[
{
'MetricName': 'TierDistribution',
'Dimensions': [{'Name': 'Tier', 'Value': tier}],
'Value': 1,
'Unit': 'Count'
},
{
'MetricName': 'ProcessingLatencyMs',
'Dimensions': [{'Name': 'Tier', 'Value': tier}],
'Value': latency_ms,
'Unit': 'Milliseconds'
}
]
)Pattern Selection Decision Tree
Start here when designing a new multi-agent workflow:
1. Are 60%+ of tasks deterministic sequences? Yes → Bedrock Flows pipeline. Add supervisor fallback for exceptions.
2. Does error cost justify 3× token spend and 60–80% latency increase? Yes → Debate tier for those cases. Use different model families (Claude + Nova).
3. Does the task require simultaneous parallel research across known domains? Yes → Broadcast with Lambda fanout. Plan the conflict resolution logic before writing the fanout code.
4. Are tasks too variable to define upfront steps? Yes → Bedrock Agent supervisor. Cap maximum delegations at five.
Build incrementally: pipeline first (week 1), add supervisor fallback (weeks 2–3), then debate tier for edge cases (weeks 4–6). Monitor tier distribution weekly — shifts in the distribution are your primary diagnostic signal.
The 41% accuracy improvement and 2.8x throughput gain we measured in the hybrid system versus a pure supervisor approach came from routing the right tasks to the right pattern, not from better models or better prompts. Pattern selection is the infrastructure decision that compounds.
[1] DORA State of DevOps 2023 — elite vs. low performer benchmarks for deployment frequency and recovery time [2] Amazon Bedrock documentation — Bedrock Flows, Bedrock Agents, and Step Functions integration [3] Internal audit: 23 enterprise agentic AI projects, Q1 2025 — pattern selection and cost analysis