The cheapest token isn't the one you negotiate down — it's the one you route to a smaller model that does the step just as well.
From price-per-token to cost-per-task: the metric that matters for agents
When teams ask me how to reduce Claude Opus cost, they almost always open with the per-token rate. That's the wrong unit. The biggest lever in LLM model routing cost savings isn't the sticker price — it's which model handles which step of the task. A single agentic task isn't one call; it's a loop of classification, retrieval, tool selection, execution, and (sometimes) hard reasoning. Each of those steps has wildly different intelligence requirements, and pricing them all at the top-tier rate is how budgets blow up.
I think in cost-per-task: the fully-loaded token spend to take one unit of work from input to accepted output, across every model call the agent makes to get there. That framing changes the question from "what's the cheapest model?" to "what's the cheapest model mix that still ships an acceptable result?" The answer is almost never "Opus for everything," and it's almost never "Haiku for everything" either.
This matters more for agents than for one-shot calls because agent loops re-send an accumulating context on every turn. Spend grows non-linearly with the number of turns, so a step you misroute to Opus gets paid for again and again as the conversation grows. Routing — deciding the model per step, not per app — is the single highest-leverage decision in an agentic system's economics.
The Claude family price ladder and the 5x output multiplier
Here's the current list-price ladder. A clean Claude Opus, Sonnet, and Haiku cost comparison starts with one structural fact: across every current model, output is priced at exactly 5x input [1]. That ratio is the hinge of every routing decision.
| Model | Input ($/MTok) | Output ($/MTok) | Output ÷ Input |
|---|---|---|---|
| Claude Haiku 4.5 | $1 | $5 | 5x |
| Claude Sonnet 4.6 | $3 | $15 | 5x |
| Claude Opus 4.8 | $5 | $25 | 5x |
| Claude Fable 5 | $10 | $50 | 5x |
Source: [1]
Two things jump out. First, Opus output ($25) costs 5x Opus input ($5), and also 5x a full Haiku call's input — so a chatty Opus step that emits lots of tokens is the most expensive thing in your system per unit of work. Second, the ladder spans an order of magnitude: a token generated by Fable 5 costs 10x the same token from Haiku.
There's a historical detail worth internalizing. The prior-generation Opus 4.1 was priced at $15/$75 [1] — three times today's Opus. Current Opus 4.8 lists at $5/$25. If your cost model or your team's mental anchor still assumes "Opus is the expensive one at $75 output," it's stale by 3x, and that stale anchor pushes people to over-route away from Opus when they no longer need to.
Where each model belongs in an agent
The routing heuristic I default to maps the price ladder onto the cognitive load of each step.
- Classify and triage with Haiku. Intent detection, "is this a refund or a complaint," routing a ticket to a queue, deciding which tool to call, extracting a few structured fields — these are high-volume, low-ambiguity steps. They're short on both input and output, and Haiku handles them at quality that holds up against more expensive models on the metrics that matter. This is where most of your call volume lives.
- Execute with Sonnet. Drafting a response, summarizing a retrieved document, transforming data, generating a SQL query from a resolved intent, writing a tool-call payload — the workhorse middle. Sonnet 4.6 sits at $3/$15, a clean 3x below Opus, and is the right default for steps that need real competence but not frontier reasoning. This is where most of your token volume lives.
- Reason with Opus. Multi-step planning, ambiguous root-cause analysis, a hard refactor, synthesizing conflicting sources, anything where a wrong answer is expensive and the path isn't obvious. Reserve Opus 4.8 (and Fable 5 above it) for the genuinely hard reasoning that justifies the rate. This is where most of your value lives — and it should be a minority of your calls.
| Step type | Model | Rate ($/MTok in/out) | Why it fits |
|---|---|---|---|
| Classify / route / extract | Haiku 4.5 | $1 / $5 | High volume, low ambiguity, short I/O |
| Execute / draft / summarize / transform | Sonnet 4.6 | $3 / $15 | Competent workhorse, 3x cheaper than Opus |
| Plan / reason / synthesize hard cases | Opus 4.8 | $5 / $25 | Frontier reasoning where errors are costly |
| Exceptional difficulty | Fable 5 | $10 / $50 | Ceiling — use sparingly, by exception |
Source: [1]
The mistake I see most is using one model for the whole agent because it's simpler to configure. A unified Opus agent is paying $25 output rates to classify tickets. A unified Haiku agent is failing silently on the small share of tasks that needed real reasoning, then eating retry costs. The point of multi-agent orchestration patterns is that you can let a cheap model handle the loop and escalate the hard sub-tasks — spawning a Haiku subagent for exploration while keeping the expensive model for the work that needs it.
Designing a routing policy: complexity signals, escalation, fallbacks
A routing policy is a decision function: given a step, pick the model. It needs three pieces.
Complexity signals. What does the router look at to decide? Cheap, deterministic signals first: step type (a classification node always starts at Haiku), input length, whether tools are involved, whether prior steps already failed. A common pattern is a Haiku classifier whose only job is to score the incoming task's difficulty, then route accordingly — a few hundred Haiku tokens to avoid a few thousand Opus tokens is a trade that pays for itself immediately.
Escalation. Start cheap, escalate on evidence. If Haiku's classification confidence is low, or Sonnet's draft fails a validation check, promote that specific step to the next tier — don't promote the whole agent. Escalation should be the exception path, instrumented and counted, not a silent default.
Fallbacks. Distinguish escalation (this step is harder than expected, go up a tier) from fallback (this model errored or was overloaded, retry or reroute). A 529 overloaded response is a fallback case — Haiku is often less loaded than Opus, so a fallback down the ladder can keep you serving when the top tier is saturated [2]. Bound both with a max-escalation count so a pathological task can't ratchet itself to Fable 5 in an infinite loop.
The governance angle here matters: a routing policy is only real if it's enforced in one place — a router module, a gateway, an agent-design rule — rather than scattered across call sites where any engineer can hardcode opus because it "just works." Validate the policy the way you'd validate any other behavior, with the same rigor you'd apply to AI agent evaluation and testing: route a labeled set through it and confirm the cheap tiers hold quality before you trust the savings.
Stacking routing with caching and batch for compounding savings
Routing is the first lever, not the only one. The big wins come from stacking it with the two discounts that apply on top of whichever model you routed to.
Prompt caching. A cache read costs 0.1x the base input rate — roughly 90% off the cached portion [3]. For agent loops that re-send a large stable prefix (system prompt, tool definitions, retrieved context) on every turn, this is enormous: the second turn onward pays a tenth for everything it already sent. The catch is the cache write premium — 1.25x base input for the 5-minute TTL, 2x for the 1-hour TTL [3] — so caching pays off when a prefix is reused enough times within the TTL, which is exactly the shape of an agent loop. For the placement mechanics, LLM token optimization covers where to put breakpoints so a single byte change doesn't silently invalidate the whole prefix.
Batch. The Batch API takes 50% off both input and output [4], stacking with routing. A batched Opus 4.8 call runs $2.50/$12.50; batched Sonnet $1.50/$7.50; batched Haiku $0.50/$2.50. For any work that isn't latency-sensitive — overnight enrichment, bulk classification, eval runs — route to the right tier and batch it.
| Lever | Effect | Stacks with routing? |
|---|---|---|
| Routing | Pick the cheapest model per step | — (this is the base lever) |
| Prompt cache read | 0.1x base input (~90% off cached prefix) | Yes |
| Prompt cache write | 1.25x (5-min) / 2x (1-hr) base input | Yes (cost incurred on first write) |
| Batch API | 50% off input and output | Yes |
The compounding is the point. Route a bulk classification job to Haiku (vs. Opus), batch it (50% off), and cache the shared instruction prefix (90% off the repeated input) — each lever multiplies against the others, and the gap versus a naive "Opus, no discounts" baseline becomes very large.
The Opus 4.7+ tokenizer effect: same task, higher token count after an upgrade
Here's a trap that quietly raises cost-per-task without anyone touching the price sheet. Opus 4.7 and later count tokens differently from earlier Opus: the same input text produces a higher token count [5]. The per-token rate is unchanged — but the same prompt counts as more tokens, so the effective cost of an identical task rises after you upgrade.
I've watched a team migrate from an older Opus to 4.7, see their per-task spend jump, and assume they'd been hit with a price increase. They hadn't. The rate held; the token count moved. This is real and specific to the 4.7+ generation, so plan for it:
- Re-baseline token counts on a representative sample after any upgrade into the 4.7+ family — re-run the token-counting endpoint against the new model rather than applying a blanket multiplier from the old tokenizer [5].
- Re-check anything keyed to measured token counts: cost dashboards, budget alerts, rate-limit retry thresholds, and compaction triggers.
- Factor the heavier count into your routing math. The tokenizer change makes the case for routing stronger — every token you move off Opus onto Sonnet or Haiku now avoids a slightly heavier count too.
The flip side: the 1M-token context window on Opus 4.8/4.7 and Sonnet 4.6 is billed at standard per-token rates with no long-context premium [5], so the tokenizer change isn't compounded by a context surcharge.
Making the policy stick with model-access control
A routing policy that any engineer can bypass with a one-line model override is not a policy — it's a suggestion. The hardest part of Claude model access control across Opus, Sonnet, and Haiku is organizational, not technical: people default to the smartest model because it's the safe choice for them, and the cost lands on a shared budget.
The fix is to make model access a governed resource rather than a free parameter.
- Gate model selection behind the router. Application code calls a routing function, not
messages.create(model="opus")directly. The model string is chosen by policy, in one place, reviewable in code review. - Default down, escalate by exception. New code paths start at Haiku or Sonnet. Reaching for Opus or Fable 5 should require an explicit, logged decision — ideally a config flag that shows up in review, not a quiet default.
- Per-team and per-project allowances. Set which teams and projects can call the top tiers at all, and surface that in your access layer.
One channel myth to kill while you're at it: the per-token rate is identical across the direct Anthropic API and the Claude Platform on AWS — there's parity [6]. "Switch to Bedrock to save money" is not a thing; the per-token Claude rates are the same there too [7]. Savings come from routing and the discount levers, not from the billing channel. Governance and attribution are good reasons to run through AWS; a lower per-token rate is not one. This connects to broader AI governance practice: access control, attribution, and policy enforcement are the same muscles whether you're governing cost or risk.
Measuring it: per-task cost telemetry and attribution
You can't govern what you can't see, and the unit you need to see is cost-per-task — not cost-per-token, not the monthly bill. Build telemetry around the task as the atom.
Every model response carries a usage block with input tokens, output tokens, cache-creation tokens, and cache-read tokens broken out separately [3]. Capture all four per call, tag each call with the task ID, the step type, the model used, and the team/project, then aggregate up to the task. That gives you:
- Cost-per-task by route, so you can see whether your Haiku-classify / Sonnet-execute split is actually holding, or whether escalations are quietly dragging everything to Opus.
- Cache effectiveness — if cache-read tokens are near zero across a loop that should be reusing a prefix, a silent invalidator is at work and you're paying full input price on every turn [3].
- Attribution by team and project, so the shared budget becomes a set of owned budgets. When a team sees its own cost-per-task and its own escalation rate, the incentive to default-to-Opus largely disappears on its own.
For the full picture of how this maps onto AWS billing and what attribution looks like on a governed account, Claude on Bedrock cost for agentic workloads goes deeper than I can here.
Anti-patterns: over-routing, quality regressions, hidden retry costs
Routing done badly is worse than no routing. The failure modes I watch for:
- Over-routing to the cheap tier. Pushing reasoning-heavy steps onto Haiku to save money, then paying for it in wrong answers. If a step genuinely needs Opus, routing it to Haiku doesn't save money — it defers the cost into retries, escalations, and human cleanup.
- Hidden retry costs. A Haiku step that fails validation and re-runs on Sonnet, which fails and re-runs on Opus, has now paid for three calls to do one step's work — and if it's an agent loop, each retry re-sends the accumulated context. Count retries in your cost-per-task, not just first-attempt calls. A high retry rate means your initial routing is too aggressive and you'd be cheaper starting one tier up.
- Quality regressions that hide until they don't. A model swap that "works in testing" can degrade on the long tail. Gate routing changes behind an eval set, and monitor acceptance and rework rates per route in production — a rising rework rate on a cheap route is the cost leak finding a new place to hide.
- Caching that silently stops working. Interpolating a timestamp or unsorted JSON into a cached prefix invalidates it on every request; you keep paying the write premium and never get the read discount. Verify cache-read tokens are non-zero — don't assume [3].
- Stale price anchors. Routing rules written when Opus listed at $15/$75 over-avoid an Opus that's now $5/$25 [1]. Re-baseline the policy against current rates.
The bottom line
The cost of an agentic system is set by its routing policy, not its per-token price. Think in cost-per-task; map the Claude ladder onto the cognitive load of each step — Haiku to classify, Sonnet to execute, Opus to reason, Fable 5 by exception; stack caching and batch on top of whichever model you routed to; account for the Opus 4.7+ tokenizer change when you upgrade; and enforce the policy with real model-access control and per-task telemetry so it survives contact with a team under deadline pressure. Do that and you can optimize Claude spend without losing quality — because you're not making the model dumber, you're stopping the smart model from doing the dumb work.
For teams that want this enforced through their existing AWS controls — IAM-based access policies, per-team attribution, and consolidated billing — ASCENDING, an AWS Premier Consulting Partner and Anthropic partner, documents the pattern in its guide to governing Claude spend on Amazon Bedrock.
Disclosure: Explore Agentic is published by ASCENDING, which builds Jarvis AI on Claude and Amazon Bedrock; we have a commercial interest in the partner-led path described here. The pricing facts above come from Anthropic's and AWS's published documentation regardless.
FAQ
How much can model routing actually save versus using Opus everywhere?
It depends on your step mix, but the structural gap is large. Opus output ($25/MTok) is 5x Haiku output ($5/MTok) and Opus input ($5) is 5x Haiku input ($1), so routing high-volume classification steps from Opus to Haiku cuts those steps' cost by roughly 80%. Most agents spend the bulk of their calls on classification and execution rather than hard reasoning, so moving those tiers down while reserving Opus for the genuinely hard minority typically collapses cost-per-task substantially — and stacking batch and caching widens the gap further.
When should an agent step escalate from Haiku to Sonnet or Opus?
Escalate on evidence, not by default. Good triggers are low classifier confidence, a failed validation or schema check on the output, a tool call that returned an error the model needs to reason about, or a task your difficulty classifier scored as high. Escalate the specific step, not the whole agent, and cap the number of escalations so one task can't ratchet itself up to the most expensive tier in a loop. If a route's escalation rate is consistently high, that step was misrouted and you should start it one tier up.
Why is Opus output so much more expensive than Haiku?
Two multipliers compound. Across all current Claude models, output tokens are priced at 5x input tokens, and Opus sits higher on the model ladder than Haiku to begin with ($5/$25 versus $1/$5). So an Opus step that generates a lot of text pays the top-tier rate and the 5x output multiplier at the same time. That's why a chatty Opus step is often the single most expensive thing in an agent, and why routing verbose generation to a cheaper tier is high-leverage.
Can I stop teams from defaulting every call to the most expensive model?
Yes, and it's mostly an access-control problem rather than a code problem. Route all model selection through a single policy layer instead of letting call sites hardcode a model string, default new code paths to Haiku or Sonnet, and require an explicit, reviewable decision to reach for Opus or Fable 5. Pair that with per-team and per-project allowances in your access layer and per-task cost attribution, so each team sees and owns its own spend. Once cost is attributed, the incentive to default to the smartest model largely disappears.
Do caching and batch discounts apply on top of routing?
Yes — they stack with whichever model you routed to. A prompt cache read costs about 0.1x the base input rate (roughly 90% off the cached prefix), and the Batch API takes 50% off both input and output. So you can route a bulk job to Haiku, batch it for half off, and cache the shared instruction prefix for another large discount on the repeated input, and all three multiply against each other. The cache write does carry a premium (1.25x for 5-minute, 2x for 1-hour), so caching pays off when the prefix is reused within its TTL — which is the normal shape of an agent loop.
Does Bedrock change the per-token price of each Claude model?
No. The per-token Claude rates are identical across the direct Anthropic API and the Claude Platform on AWS — there's parity. Savings come from routing and the discount levers (caching, batch, negotiated terms), not from the billing channel itself. There are good reasons to run on AWS — IAM-based access control, attribution, and consolidated billing through your AWS account — but a lower per-token rate is not one of them.
References
- Claude pricing — model list prices and the output-is-5x-input ratio; historical Opus 4.1 at $15/$75 — Anthropic platform docs: https://platform.claude.com/docs/en/about-claude/pricing
- Claude API errors — 529 overloaded handling and routing to a less-loaded model — Anthropic platform docs: https://platform.claude.com/docs/en/api/errors
- Prompt caching — cache read 0.1x base input, cache write 1.25x (5-min) / 2x (1-hr), and the usage block fields for verification — Anthropic platform docs: https://platform.claude.com/docs/en/build-with-claude/prompt-caching
- Batch processing — 50% discount on input and output — Anthropic platform docs: https://platform.claude.com/docs/en/build-with-claude/batch-processing
- Claude model migration — Opus 4.7+ token-counting change (more tokens for the same text; re-baseline rather than apply a multiplier) and the 1M context window at standard pricing with no long-context premium — Anthropic platform docs: https://platform.claude.com/docs/en/about-claude/models/migration-guide
- Claude Platform on AWS — same-day API parity, including per-token rate parity with the direct Anthropic API — Anthropic platform docs: https://platform.claude.com/docs/en/build-with-claude/claude-platform-on-aws
- Claude on Amazon Bedrock — same Messages API and per-token Claude rates — Anthropic platform docs: https://platform.claude.com/docs/en/build-with-claude/claude-on-amazon-bedrock