Tactical Edge
Contact Us
Back to Insights

Multi-Agent Orchestration Patterns That Actually Work in Production

Most teams default to supervisor architecture and pay 2-4x cost penalties. Here are the four patterns that matter—Supervisor, Pipeline, Debate, Broadcast—implemented with Amazon Bedrock Agents, Bedrock Flows, and Step Functions.

Agentic AI18 min
By Arun Mehta, Chief Technology Officer · May 1, 2026
Multi-Agent SystemsAmazon BedrockAWSAgentic AIEnterprise Architecture

We audited 23 enterprise agentic AI projects in early 2025. Nineteen used supervisor as their primary orchestration pattern. Fourteen of those nineteen had workloads where 60% or more of tasks followed a deterministic sequence that would have been dramatically faster and cheaper as a pipeline. Those teams were paying 2-4x token cost penalties and seeing 2-3x slower response times for no reason other than defaulting to the pattern with the best GitHub documentation.

Orchestration pattern selection is the highest-leverage architectural decision in a multi-agent system. Get it wrong and no amount of prompt engineering recovers the cost and latency you've burned into the design. Get it right and the same underlying models deliver dramatically different business outcomes.

We tested four patterns against compliance review workloads using Amazon Bedrock. Here's what the data actually shows.

The Four Patterns and When Each Wins

DimensionSupervisorPipelineDebateBroadcast
Task determinismLow (dynamic)High (fixed steps)Low (ambiguous)Medium (known sources)
LatencySlowest (serial + reasoning)Fastest for sequential60–80% overheadFast (parallel, bounded by slowest)
Token costHigh (coordinator overhead)Lowest2–3× baselineProportional to agent count
Error toleranceMediumLow (propagation risk)Highest accuracyMedium (aggregation dependent)
Scaling ceiling5 agents before bottleneckUnlimited (linear chain)2–4 debaters max10+ agents feasible
Best forDynamic triage, exception handlingDocument processing, ETLHigh-stakes decisionsMulti-source research, enrichment

Pattern 1: Supervisor — Amazon Bedrock Agent with Action Groups

The supervisor pattern uses a coordinator agent that dynamically selects which specialist agents to invoke based on the task at hand. On Amazon Bedrock, this maps directly to a Bedrock Agent with Action Groups, where each action group represents a specialist capability.

Best for: Tasks where the processing path isn't known in advance — exception handling, dynamic triage, cases where the agent must reason about which sub-task to execute next.

Key weakness: Past five concurrent delegations, the coordinator becomes a serial bottleneck. We measured 4x response time spikes when supervisor agents tried to manage more than five parallel specialists.

python
import boto3
import json

bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name='us-east-1')

def invoke_supervisor_agent(
    agent_id: str,
    agent_alias_id: str,
    session_id: str,
    task: str
) -> str:
    """
    Invoke a Bedrock Agent acting as supervisor.
    The agent's action groups define the available specialist capabilities.
    """
    response = bedrock_agent_runtime.invoke_agent(
        agentId=agent_id,
        agentAliasId=agent_alias_id,
        sessionId=session_id,
        inputText=task,
        enableTrace=True  # Capture reasoning chain for observability
    )

    completion = ""
    for event in response.get("completion"):
        if "chunk" in event:
            completion += event["chunk"]["bytes"].decode()

    return completion

# Cap delegations per coordinator invocation to avoid bottlenecks
MAX_DELEGATIONS = 5

Recommended guardrail: Cap token budgets per delegation by setting max tokens in the agent's foundation model configuration. Uncapped supervisor agents can run up $40-60 in token costs on complex multi-hop reasoning before returning a result.

Pattern 2: Pipeline — Amazon Bedrock Flows with Timeout-Based Lane Splitting

The pipeline pattern processes tasks through a fixed sequence of specialized agents, where each agent's output becomes the next agent's input. On Amazon Bedrock, this maps to Bedrock Flows (Prompt Flows) for managing the sequential handoffs, with AWS Step Functions handling conditional routing between lanes.

Performance: 3.2x throughput improvement versus supervisor for contract review — 14.6 seconds per document versus 47 seconds with a supervisor coordinator.

Failure mode: One slow agent blocks all downstream processing. Implement timeout-based lane splitting via Step Functions to route slow cases to an alternative path rather than letting them block the queue.

python
import boto3
import json
import time

bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name='us-east-1')
sfn = boto3.client('stepfunctions', region_name='us-east-1')

def run_pipeline_flow(flow_id: str, flow_alias_id: str, input_data: dict) -> dict:
    """
    Execute a Bedrock Flow (pipeline) with timeout detection.
    Falls back to Step Functions slow-lane if extraction exceeds threshold.
    """
    start = time.time()

    response = bedrock_agent_runtime.invoke_flow(
        flowIdentifier=flow_id,
        flowAliasIdentifier=flow_alias_id,
        inputs=[{
            "content": {"document": json.dumps(input_data)},
            "nodeName": "FlowInputNode",
            "nodeOutputName": "document"
        }]
    )

    result = {}
    for event in response.get("responseStream"):
        if "flowOutputEvent" in event:
            result = json.loads(event["flowOutputEvent"]["content"]["document"])

    elapsed = time.time() - start

    if elapsed > 10.0:
        # Route slow cases to supervisor fallback via Step Functions
        result["lane"] = "slow"
        result["elapsed_ms"] = int(elapsed * 1000)
    else:
        result["lane"] = "fast"

    return result

def route_to_step_function(state_machine_arn: str, payload: dict) -> str:
    response = sfn.start_execution(
        stateMachineArn=state_machine_arn,
        input=json.dumps(payload)
    )
    return response["executionArn"]
Four Orchestration Patterns on Amazon Bedrock
Four Orchestration Patterns on Amazon Bedrock

The pipeline wins whenever tasks follow predictable sequences. Contract review is a canonical example: extract clauses, classify clause types, score risk, generate summary. Each step has a known input schema and produces a known output schema. There's no dynamic replanning required. Forcing this into a supervisor pattern costs you 47 seconds instead of 14.6 seconds per document, every document, at scale.

Pattern 3: Debate — Cross-Model on Amazon Bedrock InvokeModel

The debate pattern fans out the same task to two agents built on different model families, then uses a third model as a judge to resolve disagreements. The cross-model approach is critical: using the same model family for both debaters reduces the disagreement rate to 8%, cutting the accuracy benefit significantly.

Accuracy improvement: 34% more reasoning errors caught versus single-agent baseline — 31 misclassifications versus 47 across 600 ambiguous clauses.

Cost: Token usage roughly triples. Latency increases 60–80%. Use debate only where the error cost justifies it: high-stakes decisions, fraud detection, compliance determinations.

python
import boto3
import json
import asyncio
from concurrent.futures import ThreadPoolExecutor

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

def invoke_model_sync(model_id: str, prompt: str, document: str) -> str:
    """Synchronous Bedrock InvokeModel call for thread pool execution."""
    if "anthropic" in model_id:
        body = {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 2000,
            "messages": [{"role": "user", "content": f"{prompt}

{document}"}]
        }
        response = bedrock.invoke_model(modelId=model_id, body=json.dumps(body))
        return json.loads(response["body"].read())["content"][0]["text"]
    else:
        # Amazon Nova / other models
        body = {
            "messages": [{"role": "user", "content": [{"text": f"{prompt}

{document}"}]}]
        }
        response = bedrock.invoke_model(modelId=model_id, body=json.dumps(body))
        return json.loads(response["body"].read())["output"]["message"]["content"][0]["text"]

async def run_debate(document: str, debater_prompt_a: str, debater_prompt_b: str, judge_prompt: str) -> dict:
    """
    Fan out to two different model families for cross-model debate.
    Cross-model disagreement rate: 23% (vs 8% same-family).
    Judge resolves correctly 89% of the time.
    """
    loop = asyncio.get_event_loop()
    executor = ThreadPoolExecutor(max_workers=2)

    # Different model families reduce shared-bias false agreements
    analysis_a, analysis_b = await asyncio.gather(
        loop.run_in_executor(
            executor, invoke_model_sync,
            "anthropic.claude-3-5-sonnet-20241022-v2:0",
            debater_prompt_a, document
        ),
        loop.run_in_executor(
            executor, invoke_model_sync,
            "amazon.nova-pro-v1:0",
            debater_prompt_b, document
        )
    )

    # Judge with a third model to avoid bias
    judgment = invoke_model_sync(
        "anthropic.claude-opus-4-7:0",
        judge_prompt,
        f"Analysis A:
{analysis_a}

Analysis B:
{analysis_b}"
    )

    return {
        "judgment": judgment,
        "analysis_a": analysis_a,
        "analysis_b": analysis_b,
        "models_used": [
            "claude-3-5-sonnet-v2",
            "nova-pro-v1",
            "claude-opus-4-7 (judge)"
        ]
    }

For the insurance claims system processing 40,000 claims per month, the debate tier on claims over $50,000 prevented an estimated $380,000 in annual overpayments. The 2-4 minute processing time for that 5% of high-value claims was economically justified by the error cost prevented.

Cross-Model Is Non-Negotiable for Debate
Using Claude 3.5 Sonnet for both debaters drops disagreement rates from 23% to 8% and reduces the accuracy benefit from 34% to roughly 12%. The whole point of debate is surfacing cases where different reasoning paths reach different conclusions. Same-model debate is expensive consensus theater, not error catching. Use at minimum two different model families — Claude and Nova, or Claude and Titan.

Pattern 4: Broadcast — Parallel Bedrock Agents via Lambda Fanout

The broadcast pattern dispatches the same query to multiple specialist agents simultaneously, then aggregates their outputs. On Amazon Bedrock, this combines Bedrock Agents for each specialist with AWS Lambda for the fanout and aggregation layer.

Real-world result: Prospect research system broadcasting to six specialist agents simultaneously — LinkedIn signals, SEC filings, news sentiment, patent activity, Glassdoor culture signals, CRM history — reduced processing from 12 minutes to 90 seconds.

python
import boto3
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Callable

bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name='us-east-1')

SPECIALIST_AGENTS = {
    "linkedin": {"agent_id": "AGENT_ID_1", "alias_id": "ALIAS_1"},
    "sec_filings": {"agent_id": "AGENT_ID_2", "alias_id": "ALIAS_2"},
    "news": {"agent_id": "AGENT_ID_3", "alias_id": "ALIAS_3"},
    "patents": {"agent_id": "AGENT_ID_4", "alias_id": "ALIAS_4"},
    "glassdoor": {"agent_id": "AGENT_ID_5", "alias_id": "ALIAS_5"},
    "crm": {"agent_id": "AGENT_ID_6", "alias_id": "ALIAS_6"},
}

def invoke_specialist(specialist_name: str, config: dict, query: str, session_id: str) -> tuple[str, str]:
    response = bedrock_agent_runtime.invoke_agent(
        agentId=config["agent_id"],
        agentAliasId=config["alias_id"],
        sessionId=f"{session_id}-{specialist_name}",
        inputText=query
    )
    result = ""
    for event in response.get("completion"):
        if "chunk" in event:
            result += event["chunk"]["bytes"].decode()
    return specialist_name, result

def broadcast_and_aggregate(query: str, session_id: str) -> dict:
    """
    Fan out to all specialists in parallel, aggregate results.
    Bounded by slowest agent — monitor p95 latency per specialist.
    """
    results = {}

    with ThreadPoolExecutor(max_workers=len(SPECIALIST_AGENTS)) as executor:
        futures = {
            executor.submit(invoke_specialist, name, config, query, session_id): name
            for name, config in SPECIALIST_AGENTS.items()
        }

        for future in as_completed(futures, timeout=30):
            specialist_name, result = future.result()
            results[specialist_name] = result

    return results

Conflict resolution: Six specialist agents frequently return contradictory signals. Explicit conflict resolution rules in the aggregation agent are mandatory — without them, you get a synthesis that quietly ignores contradictions. Define resolution priority order in the aggregation agent's system prompt: CRM data beats news sentiment for factual claims; SEC filings beat Glassdoor for financial metrics.

Hybrid Orchestration: Step Functions State Machine for Claims Processing

The production-grade approach combines all three patterns through AWS Step Functions, routing each request to the appropriate tier based on confidence scores and claim characteristics.

python
import boto3
import json

sfn = boto3.client('stepfunctions', region_name='us-east-1')

# Step Functions state machine definition (simplified)
STATE_MACHINE_DEFINITION = {
    "Comment": "Multi-pattern claims routing",
    "StartAt": "ExtractAndScore",
    "States": {
        "ExtractAndScore": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:ACCOUNT:function:ExtractClaimData",
            "Next": "RoutingChoice"
        },
        "RoutingChoice": {
            "Type": "Choice",
            "Choices": [
                {
                    "And": [
                        {"Variable": "$.claim_value", "NumericLessThanEquals": 50000},
                        {"Variable": "$.fraud_indicators", "NumericLessThanEquals": 2},
                        {"Variable": "$.confidence", "NumericGreaterThanEquals": 0.85}
                    ],
                    "Next": "PipelineTier"
                },
                {
                    "Or": [
                        {"Variable": "$.claim_value", "NumericGreaterThan": 50000},
                        {"Variable": "$.fraud_indicators", "NumericGreaterThan": 2}
                    ],
                    "Next": "DebateTier"
                }
            ],
            "Default": "SupervisorTier"
        },
        "PipelineTier": {
            "Type": "Task",
            "Resource": "arn:aws:states:::bedrock:invokeFlow",
            "Parameters": {
                "FlowIdentifier": "BEDROCK_FLOW_ID",
                "FlowAliasIdentifier": "BEDROCK_FLOW_ALIAS"
            },
            "End": True
        },
        "SupervisorTier": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:ACCOUNT:function:InvokeSupervisorAgent",
            "End": True
        },
        "DebateTier": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:ACCOUNT:function:RunCrossModelDebate",
            "End": True
        }
    }
}

def start_claim_workflow(claim_data: dict) -> str:
    response = sfn.start_execution(
        stateMachineArn='arn:aws:states:us-east-1:ACCOUNT:stateMachine:ClaimsOrchestration',
        input=json.dumps(claim_data)
    )
    return response["executionArn"]

This hybrid approach handled 40,000 claims per month across the three tiers: 72% through Bedrock Flows pipeline (8.3 seconds average), 23% through Bedrock Agent supervisor (34 seconds average), and 5% through cross-model debate (2.1 minutes average). The debate tier's $380,000 in prevented overpayments justified the processing overhead.

Multi-Agent Pattern Performance Metrics
Multi-Agent Pattern Performance Metrics

CloudWatch Monitoring: Three Metrics That Matter

Track these three signals in CloudWatch to detect when your orchestration is drifting:

Tier distribution. If pipeline percentage drops below 65%, upstream data quality has degraded — extraction agents are returning lower confidence scores, pushing more traffic to supervisor. This is the earliest warning signal for data pipeline problems before they become user-visible failures.

Handoff failure rate. Claims timing out in the supervisor tier indicate that routing threshold (currently set to confidence ≥ 0.85 for pipeline) needs adjustment. When supervisor tier grows beyond 30% of traffic, re-examine whether you've set the confidence threshold too conservatively.

Debate agreement rate. Agreement above 90% in the debate tier means the tier is overutilized — you're paying 3x token cost for cases where a single agent would have reached the same conclusion. Recalibrate routing to push more agreement-prone cases back to supervisor.

python
import boto3

cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')

def emit_routing_metrics(tier: str, latency_ms: float, claim_value: float):
    cloudwatch.put_metric_data(
        Namespace='ClaimsOrchestration',
        MetricData=[
            {
                'MetricName': 'TierDistribution',
                'Dimensions': [{'Name': 'Tier', 'Value': tier}],
                'Value': 1,
                'Unit': 'Count'
            },
            {
                'MetricName': 'ProcessingLatencyMs',
                'Dimensions': [{'Name': 'Tier', 'Value': tier}],
                'Value': latency_ms,
                'Unit': 'Milliseconds'
            }
        ]
    )

Pattern Selection Decision Tree

Start here when designing a new multi-agent workflow:

1. Are 60%+ of tasks deterministic sequences? Yes → Bedrock Flows pipeline. Add supervisor fallback for exceptions.

2. Does error cost justify 3× token spend and 60–80% latency increase? Yes → Debate tier for those cases. Use different model families (Claude + Nova).

3. Does the task require simultaneous parallel research across known domains? Yes → Broadcast with Lambda fanout. Plan the conflict resolution logic before writing the fanout code.

4. Are tasks too variable to define upfront steps? Yes → Bedrock Agent supervisor. Cap maximum delegations at five.

Build incrementally: pipeline first (week 1), add supervisor fallback (weeks 2–3), then debate tier for edge cases (weeks 4–6). Monitor tier distribution weekly — shifts in the distribution are your primary diagnostic signal.

The 41% accuracy improvement and 2.8x throughput gain we measured in the hybrid system versus a pure supervisor approach came from routing the right tasks to the right pattern, not from better models or better prompts. Pattern selection is the infrastructure decision that compounds.


[1] DORA State of DevOps 2023 — elite vs. low performer benchmarks for deployment frequency and recovery time [2] Amazon Bedrock documentation — Bedrock Flows, Bedrock Agents, and Step Functions integration [3] Internal audit: 23 enterprise agentic AI projects, Q1 2025 — pattern selection and cost analysis

Ready to discuss this for your organization?

Talk to our team about implementing these approaches in your environment.

Get in Touch
Tactical Edge

Production-grade agentic AI systems for the enterprise.

Washington, DC · United States

AWS PartnerAdvanced Tier Partner

Solutions

  • Agentic AI Systems
  • Moonshot Migrations
  • Agent Protocols (MCP/A2A)
  • AgentOps
  • Agent Governance
  • Cloud & Data
  • Industry Solutions
  • Amazon Quick
  • Document Automation
  • ISV Freedom Program
  • DRAIDIS

Platforms

  • Prospectory ↗
  • Projectory ↗
  • Monitory ↗
  • Connectory ↗
  • Greenway ↗
  • Detectory ↗

Services

  • Advisory & Strategy
  • Design & Engineering
  • Implementation
  • PoC & Pilot Programs
  • Agent Programs
  • Managed AI Operations
  • Governance & Compliance
  • AI Consulting

Company

  • About Us
  • Our Approach
  • AWS Partnership
  • Security
  • Insights & Resources
  • Careers
  • Contact

© 2026 Tactical Edge. All rights reserved.

Privacy PolicyTerms of ServiceAI PolicyCookie Policy