Agentic Application Architecture Patterns on AWS Bedrock

When I design agentic applications, I start with the business workflow, not a diagram full of agents. The moment an agent can change data, spend money, send a customer message, or call an internal API, I treat it like production software.

The use case is an ecommerce support assistant. A customer asks, "Can I return this damaged item and get a refund?" The system needs policy, order history, fraud signals, a draft answer, and sometimes a refund case. Policy answers can be synchronous. Larger refunds stop for review.

My rule is simple: let the agent reason, but keep authority in explicit system boundaries. A model can recommend a refund. The system decides whether it is safe.

The AWS Bedrock architecture I would use

Numbered AWS Bedrock architecture for an agentic ecommerce support assistant

The numbered flow with the support use case

I use the numbered flow to force myself to explain the runtime behavior, not only the services in the box diagram.

API Gateway receives the request.The app sends the support question with user identity, channel, and an idempotency key.
Lambda routes the request.I keep validation, duplicate detection, tenant lookup, and risk flags outside the model.
Amazon Bedrock owns the agent work.A single agent answers policy questions; a supervisor routes wider questions to order, payment, policy, or fraud specialists.
EventBridge records longer work.Refunds, replacements, and account changes become events instead of long HTTP calls.
SQS absorbs background pressure.Evidence collection and retries should survive one request ending.
Step Functions controls the workflow.The workflow can call agents, retry, branch, and pause.
The human reviewer sits in the middle.High value refunds, suspected abuse, and risky outbound messages need approve or reject.
DynamoDB, S3, and IAM hold state, facts, and permissions.DynamoDB tracks runs, S3 stores policy documents, and IAM narrows tool access.
CloudWatch and X-Ray make it inspectable.I want trace ids across API Gateway, Lambda, Bedrock, workflow steps, retries, and approval outcomes.

Single agent is where I start

What I learnt is that a single agent is easier to evaluate, easier to trace, and easier to explain to a product owner.

For this assistant, I would start with one agent that retrieves policy from a Bedrock Knowledge Base, summarizes order facts, and drafts a response. For new builds, I would also look at AgentCore for runtime, memory, gateway, identity, and observability.

I split only when prompts fight each other, tool permissions differ, or one role needs a tighter contract. A policy agent should not have the same tools as a payment adjustment agent.

Multi agent is for ownership, not decoration

When I add more agents, I want the split to reduce confusion in production instead of spreading the same confusion across more prompts.

Amazon Bedrock supports a supervisor and collaborator model where the supervisor routes work to specialist agents. I would use it for separate domains: order lookup, payment rules, return policy, fraud review, and customer communication. The supervisor owns the final answer.

The mistake I try to avoid is making every backend call an agent. A tool does one narrow thing. An agent decides which tool to use and how to combine the result with the conversation.

Synchronous versus asynchronous is the real production split

I use latency and risk to decide the execution pattern. If the user is waiting, I keep the path short. If the action has risk, I move it into a workflow.

A return policy answer can be synchronous. A refund decision may need fraud checks, approval, payment provider calls, and audit history. That belongs in an asynchronous workflow with a job id and status page.

import json
import os
import time
import uuid

import boto3
from botocore.exceptions import ClientError

bedrock = boto3.client("bedrock-agent-runtime")
events = boto3.client("events")
dynamodb = boto3.resource("dynamodb")

run_table = dynamodb.Table(os.environ["RUN_TABLE"])

SYNC_INTENTS = {"ASK_POLICY", "CHECK_ORDER_STATUS"}
HUMAN_REVIEW_INTENTS = {"REQUEST_REFUND", "CHANGE_ADDRESS", "SEND_APOLOGY_CREDIT"}
REFUND_REVIEW_LIMIT_CENTS = 2500


def route_decision(body):
    intent = body.get("intent", "ASK_POLICY")
    refund_cents = int(body.get("refundAmountCents") or 0)

    if intent in HUMAN_REVIEW_INTENTS or refund_cents > REFUND_REVIEW_LIMIT_CENTS:
        return {"mode": "async", "requiresHumanReview": True}

    if intent in SYNC_INTENTS:
        return {"mode": "sync", "requiresHumanReview": False}

    return {"mode": "async", "requiresHumanReview": True}


def lambda_handler(event, context):
    body = json.loads(event.get("body") or "{}")
    user_id = body["userId"]
    message = body["message"]
    run_id = body.get("runId") or str(uuid.uuid4())
    decision = route_decision(body)

    try:
        run_table.put_item(
            Item={
                "pk": f"RUN#{run_id}",
                "userId": user_id,
                "intent": body.get("intent", "ASK_POLICY"),
                "status": "RECEIVED",
                "requiresHumanReview": decision["requiresHumanReview"],
                "createdAt": int(time.time()),
            },
            ConditionExpression="attribute_not_exists(pk)",
        )
    except ClientError as error:
        if error.response["Error"]["Code"] != "ConditionalCheckFailedException":
            raise
        return {"statusCode": 409, "body": json.dumps({"runId": run_id})}

    if decision["mode"] == "sync":
        response = bedrock.invoke_agent(
            agentId=os.environ["SUPPORT_AGENT_ID"],
            agentAliasId=os.environ["SUPPORT_AGENT_ALIAS_ID"],
            sessionId=f"user-{user_id}",
            inputText=message,
        )
        answer = collect_text(response["completion"])
        return {"statusCode": 200, "body": json.dumps({"answer": answer})}

    events.put_events(
        Entries=[{
            "Source": "support.agent",
            "DetailType": "SupportAgentWorkRequested",
            "EventBusName": os.environ["EVENT_BUS_NAME"],
            "Detail": json.dumps({
                "runId": run_id,
                "userId": user_id,
                "message": message,
                "requiresHumanReview": decision["requiresHumanReview"],
            }),
        }]
    )

    return {
        "statusCode": 202,
        "body": json.dumps({"runId": run_id, "status": "PENDING_REVIEW"}),
    }


def collect_text(completion_stream):
    chunks = []
    for event in completion_stream:
        if "chunk" in event:
            chunks.append(event["chunk"]["bytes"].decode("utf-8"))
    return "".join(chunks)

The router makes the production decision before the agent acts. Low risk intents go through Bedrock synchronously. Refunds, address changes, credits, unknown intents, and duplicate run ids move to the workflow path so Step Functions and a reviewer control the risk.

What I would measure before calling it production ready

The measurement I care about is not only model quality. I want to know whether the system is safe, explainable, and recoverable when the model or a tool behaves badly.

Tool call volume, failure rate, and retry count.
Sync versus async request split.
Human approval rate, rejection rate, and wait time.
Duplicate handling by run id or idempotency key.
Trace coverage across Lambda, Bedrock, Step Functions, queues, and tools.

My closing recommendation is to design around responsibility. A single agent is fine when one boundary owns the answer. Multi agent systems help when roles and tools must be separated. Synchronous calls are for short answers. Asynchronous workflows are for long or risky work. The human in the middle keeps the automation honest.

What I learnt from building around agents is that autonomy helps only when state, identity, approval, tracing, and rollback all have owners.

#AgenticAI #AWSBedrock #Architecture #MultiAgent #StepFunctions #SoftwareEngineering