20 Agentic Design Patterns (Part 3)

A comprehensive guide to 20 agentic design patterns that separate pros from beginners, based on a Google engineer's 400-page book. Practical patterns you can use today with plain English explanations.

Navigation

This guide is split into 4 parts for better performance:

  • Part 1: Chapters 1-5 - Prompt Chaining, Routing, Parallelization, Reflection, Tool Use
  • Part 2: Chapters 6-10 - Planning, Multi-Agent Collaboration, Memory Management, Learning and Adaptation, Goal Setting and Monitoring
  • Part 3: Chapters 11-15 - Exception Handling and Recovery, Human in the Loop, Knowledge Retrieval (RAG), Inter-Agent Communication, Resource-Aware Optimization
  • Part 4: Chapters 16-20 - Reasoning Techniques, Evaluation and Monitoring, Guardrails and Safety Patterns, Prioritization, Exploration and Discovery

Introduction

Originally video talking about agentic systems.

You can help out the author who broke down the 400-page manual published by the Google engineer here.

Link not affiliated.


Chapter 11: Exception Handling and Recovery

TLDDR: This is just the way that you catch errors in your agentic workflows. This is an agentic pattern to help catch issues in your other agentic patterns. So essentially what you're trying to do is you do something, you add safety checks, you make the call to these services or tools or both. Then you assess whether or not it worked. If it didn't work, you take that error, you catch it, and then you have to assess and classify what kind of error is it.

When to Use

  • Production environments: Any system requiring high reliability
  • External dependencies: When relying on APIs or services
  • Critical operations: Tasks that must not fail completely
  • Unpredictable inputs: Handling edge cases and anomalies
  • Network operations: Managing connectivity issues
  • Resource constraints: Dealing with limits and quotas

Where It Fits

  • API integrations: Handling service outages and rate limits
  • Data pipelines: Managing corrupt data and processing failures
  • User-facing systems: Maintaining service availability
  • Financial transactions: Ensuring transaction integrity
  • IoT systems: Handling device failures and connectivity issues

How It Works

graph TD
    Start[Try to Do Something] --> Wrap[Add Safety Checks]
    
    Wrap --> Call[Make the Call]
    Call --> External[Call External Service]
    External --> Tool[Use a Tool]
    External --> Service[Use a Service]
    
    Tool --> Result{Did It Work?}
    Service --> Result
    
    Result -->|Success| Process[Use the Result]
    Result -->|Error| Catch[Catch the Error]
    
    Catch --> WhatKind{What Kind of Error?}
    
    WhatKind -->|Temporary| Retry[Try Again]
    WhatKind -->|Permanent| Backup[Use Backup Plan]
    WhatKind -->|Critical| Emergency[Emergency Response]
    
    Retry --> Wait[Wait a Bit]
    Wait --> AddTime[Wait Longer Each Time]
    AddTime --> Count{How Many Tries?}
    
    Count -->|Less Than Max| Call
    Count -->|Too Many| Backup
    
    Backup --> Options{Backup Options}
    
    Options --> Simple[Use Simpler Method]
    Options --> Saved[Use Saved Data]
    Options --> Default[Use Default Answer]
    Options --> Human[Get Human Help]
    
    Simple --> Recover[Start Recovery]
    Saved --> Recover
    Default --> Recover
    Human --> Recover
    
    Emergency --> SaveWork[Save Current Work]
    SaveWork --> Alert[Alert the Team]
    
    Alert --> Safety{Is It Safe to Continue?}
    
    Safety -->|Over Limit| Stop[Emergency Stop]
    Safety -->|OK| Resume[Pick Up Where We Left Off]
    
    Recover --> Record[Record What Happened]
    Resume --> Record
    Stop --> Record
    
    Record --> Track[Track Error Patterns]
    Track --> Learn[Learn From Errors]
    
    Learn --> Improve[Improve for Next Time]
    Process --> Success[Task Completed]
    Improve --> End[Continue Working]
    Success --> End

    style Start fill:#6366f1
    style WhatKind fill:#3E92CC
    style Options fill:#3E92CC
    style Safety fill:#D8315B
    style End fill:#10b981
    style Emergency fill:#D8315B
How does exception handling work?
Add safety checks, make the call to services or tools. Catch errors, classify what kind of error it is
What kinds of errors?
Temporary (retry with exponential backoff), permanent (use backup plan), critical (emergency response)
What's exponential backoff?
Wait 1 minute, then 2 minutes, then 4 minutes. Cap it so you don't retry forever if it's actually permanent
What backup options?
Use simpler method, saved data, default answer, or get human help
What about critical errors?
Save current work, alert team, check if safe to continue. If not safe, emergency stop. Otherwise resume

Pros

  • Reliability: System continues operating despite failures
  • Graceful degradation: Provides partial functionality when full service unavailable
  • Self-healing: Automatic recovery from transient issues
  • User experience: Minimizes disruption to users
  • Debugging support: Comprehensive error logging
  • Learning capability: Improves handling over time
  • State preservation: Can resume after interruptions

Cons

  • Complexity increase: Error handling adds code complexity
  • Performance overhead: Try/catch and retries add latency
  • False positives: May retry when unnecessary
  • Resource consumption: Retries and fallbacks use resources
  • Cascading failures: Poor handling can worsen problems
  • Testing difficulty: Hard to test all failure scenarios
  • Maintenance burden: Error handling code needs updates

Real-World Examples

Payment Processing System

  • Retry failed transactions with backoff
  • Fallback to alternative payment gateways
  • Save transaction state for manual review
  • Notify finance team of critical failures
  • Automatic refund on persistent failures

Data Integration Pipeline

  • Handle malformed data gracefully
  • Retry failed API calls with jitter
  • Use cached data when services unavailable
  • Checkpoint progress for resume capability
  • Alert on data quality issues

Chatbot Customer Service

  • Fallback to simpler responses on errors
  • Escalate to human agents when stuck
  • Save conversation state for handoff
  • Retry knowledge base queries
  • Default to FAQ responses

Content Delivery Network

  • Retry failed origin fetches
  • Serve stale content when origin down
  • Route to backup servers
  • Implement circuit breakers
  • Geographic failover strategies

Machine Learning Pipeline

  • Handle model loading failures
  • Fallback to simpler models
  • Retry failed predictions
  • Cache frequent predictions
  • Graceful degradation of features

IoT Device Management

  • Retry failed device commands
  • Queue commands for offline devices
  • Use last known state as fallback
  • Implement watchdog timers
  • Automatic device reboot protocols

Chapter 12: Human in the Loop

TLDDR: Adding a human in the loop where there's low to high risk depending on the situation or most importantly edge cases. So you have some form of agent processing, you have a decision point and one of those decisions could be that a review is needed or you need to actually step in and intervene. A good actual tactical example is imagine you're using some form of agentic browser or agent mode in ChatGPT. At some point it will realize it needs you to step in to add your credentials to log in to your email to Upwork to whatever service it is.

When to Use

  • High-stakes decisions: When errors have significant consequences
  • Regulatory compliance: Required human oversight for legal reasons
  • Quality assurance: Ensuring output meets standards
  • Edge cases: Handling unusual or ambiguous situations
  • Training data generation: Using human feedback to improve
  • Trust building: Gradual automation with human validation

Where It Fits

  • Content moderation: Reviewing sensitive or borderline content
  • Medical diagnosis: Physician verification of AI recommendations
  • Financial approvals: Human authorization for large transactions
  • Legal document review: Attorney oversight of contracts
  • Hiring decisions: Human review of AI-screened candidates

How It Works

graph TD
    Start[Agent Processing] --> Identify[Identify Decision Points]
    
    Identify --> Gates{Decision Gates}
    
    Gates --> Approve[Approval Required]
    Gates --> Review[Review Needed]
    Gates --> Edit[Editing Checkpoint]
    Gates --> Complex[Complex Case]
    
    Approve --> Queue[Add to Review Queue]
    Review --> Queue
    Edit --> Queue
    Complex --> Queue
    
    Queue --> Batch[Batch Similar Items]
    Batch --> Priority[Prioritize by Urgency]
    
    Priority --> UI[Present in UI]
    UI --> Context[Show Full Context]
    Context --> Diff[Display Differences]
    Diff --> SLA[Show SLA Timer]
    
    SLA --> Human{Human Decision}
    
    Human -->|Approve| Accept[Accept Agent Output]
    Human -->|Deny| Reject[Reject with Reason]
    Human -->|Edit| Modify[Human Edits Content]
    Human -->|Takeover| Manual[Full Manual Control]
    
    Accept --> Continue[Continue Workflow]
    Reject --> Learn1[Capture Rejection Pattern]
    Modify --> Learn2[Record Edit Changes]
    Manual --> Learn3[Log Takeover Reason]
    
    Learn1 --> Update[Update Agent Training]
    Learn2 --> Update
    Learn3 --> Update
    
    Update --> Improve[Improve Future Decisions]
    
    Continue --> Track[Track Decision Metrics]
    Improve --> Track
    
    Track --> Fatigue{Monitor Fatigue}
    
    Fatigue -->|High| Reduce[Reduce Human Load]
    Fatigue -->|Normal| Maintain[Maintain Current Flow]
    
    Reduce --> Automate[Increase Automation]
    Maintain --> Report[Generate Reports]
    Automate --> Report
    
    Report --> End[Process Complete]

    style Start fill:#6366f1
    style Gates fill:#3E92CC
    style Human fill:#a855f7
    style Fatigue fill:#3E92CC
    style End fill:#10b981
    style Reject fill:#D8315B
How does human in the loop work?
Agent identifies decision points - approval needed, review needed, editing checkpoint, complex case
Then what?
Add to review queue, batch similar items, prioritize by urgency. Present in UI with full context, show differences, display SLA timer
What can humans do?
Approve, deny with reason, edit content, or take full manual control
How does it learn?
Capture rejection patterns, record edit changes, log takeover reasons. Update agent training to improve future decisions
What about human fatigue?
Monitor fatigue levels. If high, reduce human load by increasing automation. Track decision metrics and generate reports

Pros

  • Quality assurance: Human judgment catches AI errors
  • Compliance: Meets regulatory requirements
  • Learning source: Human feedback improves system
  • Trust: Users confident in human oversight
  • Flexibility: Humans handle edge cases well
  • Accountability: Clear responsibility chain
  • Risk mitigation: Prevents costly mistakes

Cons

  • Scalability limits: Human bandwidth constrains throughput
  • Cost increase: Human reviewers are expensive
  • Latency addition: Waiting for human response delays process
  • Inconsistency: Different humans make different decisions
  • Fatigue effects: Quality degrades with reviewer tiredness
  • Training requirements: Reviewers need domain expertise
  • Availability issues: 24/7 coverage is challenging

Real-World Examples

Content Moderation Platform

  • AI flags potentially problematic content
  • Human reviewers make final decisions
  • Complex cases escalated to senior moderators
  • Reviewer feedback trains AI models
  • Fatigue monitoring and rotation schedules

Loan Approval System

  • AI assesses credit risk
  • Human reviews borderline applications
  • Large loans require manual approval
  • Explanations provided for denials
  • Audit trail for compliance

Medical Imaging Analysis

  • AI detects potential abnormalities
  • Radiologist confirms diagnoses
  • Critical findings prioritized for review
  • Second opinions for complex cases
  • Continuous learning from corrections

Resume Screening

  • AI filters initial applications
  • HR reviews shortlisted candidates
  • Diversity checks by humans
  • Feedback improves screening algorithms
  • Final interviews always human-led

Translation Quality Control

  • AI performs initial translation
  • Human linguists review and edit
  • Cultural sensitivity checks
  • Technical terminology verification
  • Style consistency enforcement

Autonomous Vehicle Monitoring

  • AI handles normal driving
  • Remote operators handle edge cases
  • Safety driver takeover capability
  • Incident review and analysis
  • Continuous improvement from interventions

Chapter 13: Knowledge Retrieval (RAG)

TLDDR: Indexing documents by parsing, chunking, and creating searchable embeddings. Literally RAG. So it's like having a librarian and you want to categorize or index a series of information and systems. So this one is pretty straightforward where you have a user query. You have some sources that you've ingested. You've parsed those documents, categorized them, embedded them, which again means in plain English, you take words, you turn them into vectors, you store vectors into library. So when you ask a question, you try to match the vector of the question to the vectors in your library with the closest match.

When to Use

  • Dynamic knowledge needs: Accessing up-to-date information
  • Large document collections: Querying extensive knowledge bases
  • Domain-specific applications: Specialized knowledge integration
  • Factual accuracy requirements: Grounding responses in sources
  • Citation requirements: Providing verifiable references
  • Reducing hallucinations: Ensuring factual responses

Where It Fits

  • Enterprise search: Internal document retrieval systems
  • Customer support: Knowledge base querying
  • Research assistants: Academic paper retrieval
  • Legal research: Case law and statute searching
  • Technical documentation: API and product documentation access

How It Works

graph TD
    Start[Documents to Search] --> Read[Read Documents]
    
    Read --> Parse[Extract the Text]
    Parse --> GetInfo[Get Document Info]
    GetInfo --> AddTags[Add Tags and Labels]
    
    AddTags --> Split{How to Split Text?}
    
    Split --> Fixed[Equal Size Chunks]
    Split --> Smart[Natural Breaks]
    Split --> Context[Keep Related Parts Together]
    
    Fixed --> Process[Process Each Chunk]
    Smart --> Process
    Context --> Process
    
    Process --> Convert[Convert to Searchable Format]
    Convert --> Store[Store in Search Database]
    
    Store --> Ready[System Ready to Search]
    
    Ready --> Question[User Asks Question]
    Question --> Improve[Make Question Better]
    
    Improve --> Expand[Add Related Words]
    Expand --> Search[Search Database]
    
    Search --> Find[Find Matching Chunks]
    Find --> Filter[Remove Irrelevant Ones]
    
    Filter --> Rank{Rank by Relevance}
    
    Rank --> Score[Give Each a Score]
    Score --> Sort[Sort Best to Worst]
    Sort --> Pick[Pick Top Matches]
    
    Pick --> Verify[Check Sources are Good]
    Verify --> Use[Use Sources for Answer]
    
    Use --> Generate[Create Answer]
    Generate --> Cite[Add Source References]
    
    Cite --> Quality{Is Answer Good?}
    
    Quality -->|Yes| Deliver[Give Answer to User]
    Quality -->|No| Redo[Try Different Search]
    
    Redo --> Adjust[Change Search Settings]
    Adjust --> Search
    
    Deliver --> Track[Track How Well It Worked]
    Track --> Measure[Measure Success]
    
    Measure --> Accuracy[How Accurate?]
    Measure --> Coverage[How Complete?]
    
    Accuracy --> Improve_System[Make System Better]
    Coverage --> Improve_System
    
    Improve_System --> End[Search Complete]

    style Start fill:#6366f1
    style Split fill:#3E92CC
    style Rank fill:#3E92CC
    style Quality fill:#a855f7
    style End fill:#10b981
    style Redo fill:#D8315B
How does RAG work?
Parse documents, chunk them (fixed size, natural breaks, or context-aware), convert to embeddings (vectors), store in search database
What happens when user asks question?
Improve question, expand with related words, search database. Find matching chunks, filter irrelevant ones, rank by relevance
How do you rank?
Score each match, sort best to worst, pick top K matches (usually 5-10). More matches = more context but also more risk of hallucination
Then what?
Check if sources are good, use them to generate answer, add citations. If answer quality is bad, try different search with adjusted settings
How do you improve?
Track accuracy and coverage, measure success, optimize the system based on results

Pros

  • Accuracy: Responses grounded in real sources
  • Verifiability: Citations enable fact-checking
  • Scalability: Handle vast document collections
  • Currency: Access to latest information
  • Domain expertise: Specialized knowledge integration
  • Reduced hallucination: Less fabrication of facts
  • Flexibility: Easy to update knowledge base

Cons

  • Infrastructure needs: Requires vector databases and storage
  • Processing overhead: Embedding and indexing costs
  • Retrieval quality: Dependent on chunking and matching
  • Context limitations: Retrieved chunks may lack context
  • Latency: Additional retrieval step adds delay
  • Maintenance: Knowledge base needs regular updates
  • Relevance challenges: May retrieve irrelevant information

Real-World Examples

Enterprise Knowledge Management

  • Index company policies and procedures
  • Retrieve relevant HR guidelines
  • Search technical documentation
  • Access historical project data
  • Provide sourced answers to employees

Medical Information System

  • Index medical literature
  • Retrieve treatment guidelines
  • Search drug interactions
  • Access clinical trials data
  • Provide evidence-based recommendations

Academic Research Assistant

  • Index research papers
  • Retrieve relevant studies
  • Search across disciplines
  • Find citation networks
  • Generate literature reviews

Technical Support System

  • Index product documentation
  • Retrieve troubleshooting guides
  • Search error code databases
  • Access configuration examples
  • Provide solution steps with references

News Aggregation Service

  • Index news articles in real-time
  • Retrieve relevant coverage
  • Search historical archives
  • Find related stories
  • Generate summaries with sources

Chapter 14: Inter-Agent Communication

TLDDR: Agents communicate through a structured messaging system with defined protocols. Message including IDs for tracking expiration times and security checks. So this is like an office email system with read receipts, security clearances and spam filters that prevent reply all disasters. So this is where you have language models talking to other language models. From a system perspective, this is where you could have multiple AI agents speak to one another and then you have to decide how they should communicate. Either they have one boss, one that manages all the other agents, which is sometimes really helpful to have because you have a single vector of failure that everything can report to. The next one is that everyone is equal, meaning everyone has a say at the table. It is a pure agentic democracy which sounds great but in practice really hard to dial in because you're always dealing with the risk of hallucination and misfiring.

When to Use

  • Complex workflows: Tasks requiring multiple specialized agents
  • Modular systems: Building composable agent architectures
  • Distributed processing: Agents running in different locations
  • Scalable architectures: Systems that need to grow
  • Collaborative tasks: Agents working together on problems
  • Service-oriented design: Agents as microservices

Where It Fits

  • Enterprise automation: Coordinating business process agents
  • Research systems: Agents collaborating on analysis
  • Content production: Pipeline of content creation agents
  • Trading systems: Agents coordinating financial decisions
  • Smart city systems: IoT and service agents communicating

How It Works

graph TD
    Start[Multiple AI Agents Need to Talk] --> Choose{How Should They Communicate?}
    
    Choose -->|One Boss| Manager[One Agent Manages Others]
    Choose -->|Everyone Equal| Direct[Agents Talk Directly]
    Choose -->|Post Office| Mailbox[Central Message System]
    
    Manager --> Setup[Set Up Communication Rules]
    Direct --> Setup
    Mailbox --> Setup
    
    Setup --> Rules[Message Rules]
    Rules --> Track[Tracking Number for Each Message]
    Rules --> Expire[Messages Expire After Time Limit]
    Rules --> Important[Mark Important Messages]
    
    Track --> Check{Check Who Can Talk}
    Expire --> Check
    Important --> Check
    
    Check --> Verify[Verify Agent Identity]
    Verify --> Permission[Check What They Can Do]
    Permission --> Allow[Allow Communication]
    
    Allow --> Send[Send Message]
    Send --> Deliver[Deliver to Right Agent]
    
    Deliver --> Receive[Agent Gets Message]
    Receive --> Process[Process Message]
    
    Process --> Reply{Need to Reply?}
    
    Reply -->|Yes| Answer[Send Answer Back]
    Reply -->|No| Log[Record Message Received]
    
    Answer --> Watch[Monitor Conversation]
    Log --> Watch
    
    Watch --> Problems{Any Problems?}
    
    Problems -->|Endless Loop| Stop[Stop the Loop]
    Problems -->|Stuck| Fix[Unstick the Agents]
    Problems -->|Too Long| Timeout[Cancel Old Messages]
    Problems -->|All Good| Continue[Keep Going]
    
    Stop --> Alert[Alert Human]
    Fix --> Alert
    Timeout --> Alert
    Continue --> Record[Save Conversation History]
    
    Alert --> Recover[Fix the Problem]
    Record --> Report[Create Activity Report]
    
    Recover --> End[Communication Complete]
    Report --> End

    style Start fill:#6366f1
    style Choose fill:#3E92CC
    style Check fill:#3E92CC
    style Problems fill:#a855f7
    style End fill:#10b981
    style Stop fill:#D8315B
    style Fix fill:#D8315B
How do agents communicate?
Choose communication pattern - one boss managing others, everyone equal (direct), or central message system (mailbox)
What are message rules?
Tracking number for each message, expiration times, mark important messages. Like email with read receipts
How do you control who can talk?
Verify agent identity, check permissions, allow communication. Send message, deliver to right agent
What if something goes wrong?
Monitor for endless loops, stuck agents, messages too long. Stop loops, unstick agents, cancel old messages, alert human if needed
Is this practical?
Makes a good YouTube video but not a really good production system. Very complex, lots of debugging. Enterprise level makes sense if you have tons of resources and engineers

Pros

  • Modularity: Clear separation of agent responsibilities
  • Scalability: Easy to add new agents to the system
  • Flexibility: Different communication patterns available
  • Fault isolation: Agent failures don't crash system
  • Reusability: Agents can be reused in different workflows
  • Debugging support: Message tracing aids troubleshooting
  • Parallel processing: Agents can work simultaneously

Cons

  • Complexity overhead: Communication protocols add complexity
  • Latency accumulation: Message passing adds delays
  • Coordination challenges: Managing agent interactions
  • Debugging difficulty: Tracing distributed conversations
  • State management: Maintaining consistency across agents
  • Network dependencies: Vulnerable to communication failures
  • Security concerns: Inter-agent authentication needed

Real-World Examples

E-commerce Order Processing

  • Inventory Agent checks stock availability
  • Pricing Agent calculates total costs
  • Payment Agent processes transactions
  • Shipping Agent arranges delivery
  • Notification Agent updates customer
  • Orchestrator coordinates entire flow

News Production Pipeline

  • Crawler Agent gathers news sources
  • Fact-Check Agent verifies information
  • Writer Agent creates articles
  • Editor Agent reviews content
  • Publisher Agent posts to CMS
  • Analytics Agent tracks performance

Financial Analysis Platform

  • Data Agent collects market information
  • Technical Agent performs chart analysis
  • Fundamental Agent analyzes financials
  • Risk Agent assesses portfolio exposure
  • Report Agent generates recommendations
  • Compliance Agent ensures regulations

Smart Manufacturing System

  • Sensor Agents monitor equipment
  • Quality Agents check production
  • Maintenance Agents schedule repairs
  • Inventory Agents manage supplies
  • Planning Agents optimize schedules
  • Control Agent coordinates operations

Healthcare Coordination

  • Triage Agent assesses symptoms
  • Diagnostic Agent suggests tests
  • Specialist Agents provide expertise
  • Treatment Agent recommends therapy
  • Pharmacy Agent manages medications
  • Scheduler Agent books appointments

Research Collaboration Platform

  • Literature Agent searches papers
  • Data Agent manages datasets
  • Analysis Agent runs experiments
  • Visualization Agent creates charts
  • Writing Agent drafts reports
  • Review Agent checks quality

Chapter 15: Resource-Aware Optimization

TLDDR: Analyzing a task complexity and then routing to appropriate resources. So simple tasks use cheap, fast models, but complex tasks use powerful but expensive models. Think of something like GPT-5 where there was a huge uproar because we lost all of our models. Then we got either quick thinking, kind of thinking, hard thinking or like professional thinking. Each one of those would route your request in ChatGPT to the model that it thought would be the best suited for that particular outcome. So the analogy here is a playful one where it's like choosing between walking, a bus or a taxi depending on the distance, the urgency or the budget.

When to Use

  • Cost-sensitive operations: When managing API or compute costs
  • High-volume processing: Optimizing large-scale operations
  • Variable workloads: Different tasks need different resources
  • Budget constraints: Operating within financial limits
  • Performance requirements: Balancing speed vs cost
  • Multi-tenant systems: Fair resource allocation across users

Where It Fits

  • SaaS platforms: Managing per-customer resource usage
  • Batch processing: Optimizing large data processing jobs
  • Real-time systems: Balancing latency and cost
  • Development environments: Using cheaper models for testing
  • Production systems: Optimizing operational costs

How It Works

graph TD
    Start[Task Request] --> Analyze[Analyze Complexity]
    
    Analyze --> Budget{Set Budgets}
    
    Budget --> Token[Token Limits]
    Budget --> Time[Time Constraints]
    Budget --> Cost[Money Budget]
    
    Token --> Router[Router Agent]
    Time --> Router
    Cost --> Router
    
    Router --> Classify{Classify Complexity}
    
    Classify -->|Simple| Cheap[Use Small Model]
    Classify -->|Medium| Standard[Use Standard Model]
    Classify -->|Complex| Premium[Use Advanced Model]
    Classify -->|Unknown| Test[Run Quick Test]
    
    Test --> Confidence{Check Confidence}
    
    Confidence -->|Low| Escalate[Escalate to Better Model]
    Confidence -->|High| Proceed[Continue with Current]
    
    Cheap --> Execute[Execute Task]
    Standard --> Execute
    Premium --> Execute
    Escalate --> Execute
    Proceed --> Execute
    
    Execute --> Monitor[Monitor Resources]
    
    Monitor --> Track{Track Usage}
    
    Track --> Tokens[Token Count]
    Track --> Latency[Response Time]
    Track --> Costs[API Costs]
    
    Tokens --> Check{Within Limits?}
    Latency --> Check
    Costs --> Check
    
    Check -->|Yes| Continue[Continue Processing]
    Check -->|No| Optimize[Optimization Needed]
    
    Optimize --> Prune[Prune Context]
    Optimize --> Cache[Use Cached Results]
    Optimize --> Downgrade[Switch to Cheaper Model]
    
    Prune --> Retry[Retry Operation]
    Cache --> Retry
    Downgrade --> Retry
    
    Continue --> Complete[Task Complete]
    Retry --> Monitor
    
    Complete --> Measure[Measure Quality/Cost]
    Measure --> Delta[Calculate Delta]
    
    Delta --> Tune[Tune Thresholds]
    Tune --> Learn[Update Router Logic]
    
    Learn --> Report[Generate Report]
    Report --> End[Optimized Execution]

    style Start fill:#6366f1
    style Classify fill:#3E92CC
    style Check fill:#3E92CC
    style Delta fill:#a855f7
    style End fill:#10b981
    style Optimize fill:#D8315B
How does resource-aware optimization work?
Analyze task complexity, set budgets (token limits, time, cost). Router agent classifies complexity
How does it route?
Simple → small model, medium → standard model, complex → advanced model. Unknown → run quick test, check confidence
What if you go over budget?
Monitor token count, response time, API costs. If over limits, optimize: prune context, use cached results, or switch to cheaper model
How do you improve?
Measure quality vs cost, calculate delta, tune thresholds, update router logic. That was what all the uproar around GPT-5 was - routing as many requests as possible to cheapest model while still charging you $20/month

Pros

  • Cost reduction: Significant savings on API and compute costs
  • Performance optimization: Right-sized resources for each task
  • Scalability: Efficient resource use enables growth
  • Flexibility: Dynamic adjustment to workload changes
  • Budget control: Predictable operational costs
  • Quality preservation: Maintains output quality where needed
  • Automatic optimization: Self-tuning based on patterns

Cons

  • Complexity increase: Resource management adds overhead
  • Quality variations: Different models produce different results
  • Routing overhead: Classification step adds latency
  • Monitoring requirements: Need comprehensive tracking
  • Tuning challenges: Finding optimal thresholds takes time
  • Cache management: Maintaining cache coherency
  • User experience: Inconsistent response times

Real-World Examples

Customer Support Platform

  • Simple FAQs use lightweight models
  • Complex issues use advanced models
  • Cache common question responses
  • Prioritize premium customers
  • Track cost per ticket resolution

Content Generation Service

  • Short social posts use fast models
  • Long articles use quality models
  • Reuse templates for common requests
  • Batch similar requests together
  • Monitor cost per content piece

Code Assistant Tool

  • Syntax fixes use simple models
  • Architecture design uses advanced models
  • Cache common code patterns
  • Prioritize based on project importance
  • Track cost per developer action

Translation Platform

  • Common languages use basic models
  • Rare languages use specialized models
  • Cache frequent translations
  • Batch document processing
  • Optimize cost per word translated

Data Analysis System

  • Simple aggregations use basic compute
  • Complex ML uses premium resources
  • Cache intermediate results
  • Schedule heavy jobs off-peak
  • Monitor cost per analysis

Educational Platform

  • Basic Q&A uses lightweight models
  • Complex tutoring uses advanced models
  • Cache common explanations
  • Allocate resources by subscription tier
  • Track cost per student interaction