Amazon Bedrock pricing is a per-token rate card with three purchase modes: on-demand, where you pay as you go — Claude Sonnet 4.6 costs $3 per million input tokens and $15 per million output — batch, which runs the same models asynchronously at 50% off, and Provisioned Throughput, dedicated capacity rented by the hour. That is the entire menu. Every explainer on AWS Bedrock pricing walks you through the menu. Almost none of them walks you through the check.
Here is the problem: the rate card prices a request, but you ship a task. An agent task is a loop — plan, call a tool, read the result, call the next tool — and on every turn of that loop the model re-reads the entire accumulating conversation. The first time we reconciled an agent team's Bedrock invoice against their own estimate, the invoice came in at 4.8x the estimate. Every individual rate on it was correct. The estimate had priced tokens. The bill had priced re-reads.
This guide gives you the dated 2026 rate table, then does the arithmetic the rate card skips: what a real 12-turn agent task costs, what caching and batch do to that number, the four adjacent line items agent programs never budget, and the private-offer layer where the actual negotiation happens. If you are budgeting an agentic AI program on Bedrock, this is the page to screenshot.
How AWS Bedrock Pricing Works: The Three Modes
Three purchase modes, three workload shapes. Most agent programs end up using two.
On-demand: bursty, interactive agent work
The default. You call the model, you pay per token, there is no commitment and no idle cost. This is the right mode for the defining agent workload — bursty tool-calling traffic that spikes when users are active and drops to zero at 3am. Serverless pricing fits spiky demand; you never pay for a warm instance waiting for a request that isn't coming.
Batch: offline evals, backfills, classification sweeps
Submit a JSONL file of requests to S3, get results back asynchronously — usually within hours, guaranteed within 24 — at 50% off the on-demand rate. Anything that does not need an answer while a human waits belongs here: nightly evaluation suites, document backfills, bulk classification. Half price for tolerating latency you were going to tolerate anyway is the easiest win on the whole bill.
Provisioned Throughput: steady, very-high-volume serving
You rent model units — fixed slabs of tokens-per-minute capacity — by the hour, with no-commitment, 1-month, or 6-month terms, and the hourly rate drops as the term lengthens. The public page still lists legacy examples (a single model unit of Claude Instant ran $39.60 per hour on a 1-month commitment, roughly $29,500 a month); for current-generation Claude models, AWS quotes model-unit pricing through your account team rather than the public page. Honestly: most agent teams never touch this mode. Its shape is steady, predictable, always-on volume — high-traffic chat with flat demand — and it only beats on-demand when your utilization stays high around the clock. If you are weighing per-token pricing against per-instance-hour economics more broadly, that is the Bedrock vs SageMaker decision, not a mode toggle.
The 2026 AWS Bedrock Pricing Table for Claude
Rates below are per 1 million tokens, on-demand, standard commercial regions, verified against the AWS Bedrock pricing page in July 2026. Standard-region Bedrock rates match Anthropic's first-party API list prices — you are not paying an AWS markup on the token itself.
| Model | Input | Output | Batch input | Batch output | Cache write (5-min) | Cache read |
|---|---|---|---|---|---|---|
| Claude Opus 4.8 | $5.00 | $25.00 | $2.50 | $12.50 | $6.25 | $0.50 |
| Claude Opus 4.6 | $5.00 | $25.00 | $2.50 | $12.50 | $6.25 | $0.50 |
| Claude Sonnet 5 (promo) | $2.00 | $10.00 | $1.00 | $5.00 | $2.50 | $0.20 |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $1.50 | $7.50 | $3.75 | $0.30 |
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.50 | $2.50 | $1.25 | $0.10 |
Five footnotes that matter more than the table:
- Sonnet 5's promo price expires August 31, 2026. The $2/$10 launch rate steps up to $3/$15 on September 1 — identical to Sonnet 4.6. And Sonnet 5 ships a new tokenizer that turns the same text into roughly 30% more billable tokens, so the launch discount is smaller than the sticker suggests. Model your September bill at $3/$15, not $2/$10.
- Cache write at the 1-hour TTL is 2x the base input rate (versus 1.25x for the 5-minute default). Cache reads are 0.1x either way.
- Batch coverage is per-model and per-region. AWS lists batch rows model by model; confirm your model has one in your region before you build a pipeline around the discount.
- AWS GovCloud carries a premium — Claude Opus 4.8 runs $6/$30 there.
- Nova and Llama exist and are cheap. Amazon's Nova line and Meta's Llama models rent for a fraction of Claude rates and are worth a look for classification-grade steps. Which Claude model to run per task — and the worked $/task comparisons — is its own decision; we keep that in Haiku vs Sonnet rather than repeating it here.
Where Claude pricing is published now
One source of confusion worth killing: Anthropic's docs now describe two ways to run Claude on AWS. Amazon Bedrock is the AWS-operated service, and its pricing lives on the AWS Bedrock pricing page. Claude Platform on AWS is a newer, Anthropic-operated offering with AWS IAM auth and Marketplace billing, priced at Anthropic's first-party rates. Since standard-region rates match anyway, the pricing math in this guide holds for both — but if you are quoting numbers in a budget review, cite the AWS page for Bedrock workloads.
Why Agents Break the $/1M Intuition
The rate card assumes a request is priced once. Agents violate that assumption on every turn.
The context snowball
The Bedrock inference APIs are stateless. Each turn of an agent loop re-sends everything: the system prompt, the tool definitions, the user's task, and every prior tool call and tool result. Turn 1 sends 6,500 tokens. Turn 2 sends those 6,500 plus turn 1's output and tool result. By turn 12 the model is re-reading eleven turns of history to decide one next action. Input grows linearly per turn; total input billed grows quadratically with turn count. This is not a Bedrock quirk — every stateless LLM API works this way — but Bedrock is where enterprise agent programs feel it first, because that is where the volume is.
A 12-turn task, priced honestly
Take a realistic mid-size agent task on Sonnet 4.6: a 4,000-token system prompt, 2,000 tokens of tool definitions, a 500-token user request. Each turn the model emits about 300 output tokens and gets back a 1,200-token tool result, so context grows 1,500 tokens per turn.
- Input billed across 12 turns: 6,500 + 8,000 + 9,500 + ... + 23,000 = 177,000 input tokens
- Output billed: 12 x 300 = 3,600 output tokens
- Cost: 177,000 x $3/1M + 3,600 x $15/1M = $0.531 + $0.054 = $0.585 per task
The conversation only ever contained 23,000 unique tokens. The bill shows 177,000 — a 7.7x input multiplier, and 4.8x the total cost of the naive pay-each-token-once estimate. Longer loops and fatter tool results push that multiplier well past 10x; we have watched a 30-turn research agent hit 14x. The single most useful budgeting habit on Bedrock is to stop thinking in $/1M tokens and start thinking in $/task. We go deeper on the Claude-specific version of this arithmetic — turn counts, tool-result sizing, per-workload shapes — in Claude on Bedrock: the cost of agentic workloads, and the levers for shrinking what gets re-sent live in token optimization.
Cache Math and Batch Math: The Two Multipliers
Two mechanisms sit directly on top of the rate card, and together they routinely cut an agent bill by two-thirds. Neither requires a conversation with AWS.
Prompt caching, in one paragraph
Bedrock's prompt caching charges 1.25x the input rate to write a prefix into cache (2x for the 1-hour TTL, which went GA on Bedrock in January 2026) and 0.1x to read it back. For the loop above, that means each unique token is written once and every re-send becomes a 90%-off read: 23,000 written at $3.75/1M plus 154,000 read at $0.30/1M puts input at $0.13, and the task lands just under $0.19 — a 68% cut from $0.585, with better latency as a side effect. The break-even math, TTL selection, breakpoint placement, and the silent ways caches fail to hit are a full topic of their own; read the prompt caching guide before you wire it in, because a cache that never hits still bills the write premium.
Batch, for everything offline
The -50% applies cleanly to eval workloads, and eval workloads are bigger than most teams expect. A nightly regression suite of 5,000 cases at 8,000 input / 1,000 output tokens each on Haiku 4.5 costs $65 a night on-demand (40M input tokens at $1, 5M output at $5). In batch mode: $32.50 a night, about $975 a month instead of $1,950. The only cost is latency you did not need. If an agent workload can be expressed as a file of requests — evals, backfills, enrichment — it should never run at on-demand rates.
The Line Items Nobody Budgets
The model tokens are the headline, but four adjacent services show up on real agent-program invoices. None of them will bankrupt you. All of them will surprise whoever signs off on the forecast.
Guardrails
$0.15 per 1,000 text units for content filters and denied topics, where a text unit is up to 1,000 characters — and you pay per evaluation, so screening both the user input and the model output doubles the units. At agent scale this stays small (tens of dollars a month for most programs), but each additional policy type you enable bills separately.
Knowledge Bases
The retrieval line item is really two: embedding tokens (cheap) and the vector store (not always cheap). An OpenSearch Serverless vector store has a floor around $350 a month — two OpenSearch Compute Units at $0.24 per OCU-hour, billed whether you query or not — and common configurations run double that. S3 Vectors cuts that floor by as much as 90% for most agent-scale corpora and is the default we now recommend for new builds; the full decision is covered in S3 Vectors for enterprise RAG.
AgentCore
If your agents run on Bedrock AgentCore rather than your own compute, the runtime bills $0.0895 per vCPU-hour and $0.00945 per GB-hour at per-second granularity, with CPU charged on active use only — I/O wait while the model thinks is free on the CPU meter. Gateway adds $0.005 per 1,000 tool invocations, and Memory bills per event stored and per retrieval. For a program running tens of thousands of sessions a month this typically lands in the low hundreds of dollars — real, but a rounding error next to the tokens.
Cross-region inference
No separate routing fee: you are billed at the rate of the region you call from. The nuance is profile type — geographic profiles (data kept within a boundary) can price about 10% above global profiles for some models. If you have no residency constraint, the global profile is the cheaper and higher-throughput default.
A Worked Monthly Bill
Here is a composite agent program we would consider realistic for a mid-size deployment: 2,000 interactive agent tasks a day on Sonnet 4.6 (the 12-turn shape above, cached), a nightly 5,000-case eval suite on Haiku 4.5 in batch, Guardrails on interactive traffic, a Knowledge Base, and AgentCore hosting.
| Line item | Volume | Basis | Monthly |
|---|---|---|---|
| Interactive agent tasks (Sonnet 4.6, cached) | 60,000 tasks | ~$0.19/task | $11,400 |
| Nightly evals (Haiku 4.5, batch -50%) | 150,000 cases | $32.50/night | $975 |
| Guardrails (input + output screening) | 300,000 text units | $0.15/1K units | $45 |
| Vector store (OpenSearch Serverless floor) | 2 OCUs | $0.24/OCU-hr | $350 |
| AgentCore runtime + gateway | 60,000 sessions | vCPU-hr + GB-hr | $230 |
| Total | ~$13,000 |
Two things to notice. First, the model tokens are 88% of the bill — the "hidden" line items are real but the loop is the story. Second, the caching decision alone is worth more than everything else combined: run those same 60,000 tasks uncached and the first line becomes $35,100. The gap between a $13,000 month and a $37,000 month is not a discount anyone granted you. It is arithmetic you either did or did not do.
How to Budget a Bedrock Agent Program: Six Steps
- Price one task, not one million tokens. Instrument a representative task end to end and capture per-turn input, output, cache reads, and cache writes from the usage metadata. Multiply from measured $/task, never from the rate card.
- Turn on prompt caching and verify it hits. Check cache read counts in the response usage on real traffic. A configured cache that never hits is a 1.25x surcharge, not a discount.
- Route each step to the cheapest capable model. Classification-grade steps do not need Opus. The routing decision framework is in model routing and cost per task.
- Move everything offline to batch. Evals, backfills, enrichment — anything expressible as a file of requests earns -50% for free.
- Budget the adjacent line items. Guardrails, vector store, AgentCore, cross-region profile choice. Small individually; embarrassing collectively when the forecast missed them.
- Take a 12-month volume forecast into a private-offer conversation. Which is the last section, and it only works if you did steps 1 through 5 first — the agent ROI math you bring is the negotiation.
The Discount Layer: How AWS Bedrock Pricing Actually Bends
The on-demand rate card is public and, in our experience, immovable. Nobody is getting $2.40/$12 Sonnet tokens because they asked nicely. What moves is everything wrapped around the rate card: AWS Marketplace private offers and volume commitments negotiated against a credible forecast of committed spend, with terms that improve as the commitment grows. That negotiation rewards exactly the homework this guide describes — measured $/task, a cache strategy already deployed, batch traffic already separated — because a forecast built on optimized consumption is one AWS can price against with confidence.
This is the point we keep repeating to teams mid-evaluation: there is no cheaper Bedrock token. The discount is the stack — caching, routing, batch, then a private offer on the volume that remains. Teams that skip the first three layers and go straight to negotiation commit to inflated volume and lock in their own waste. How the commitment layer is structured, what terms we have seen move, and how it interacts with plan selection are covered in the enterprise Claude discount stack and Claude enterprise pricing paths.
Frequently Asked Questions
How much does AWS Bedrock cost per million tokens?
As of July 2026, Claude on Bedrock runs $5/$25 per million input/output tokens for Opus 4.8 and Opus 4.6, $3/$15 for Sonnet 4.6, and $1/$5 for Haiku 4.5, with Sonnet 5 at a promotional $2/$10 through August 31, 2026 before stepping up to $3/$15. Batch mode halves those rates, cache reads cost 0.1x the input rate, and standard-region prices match Anthropic's first-party API. Always check the AWS pricing page at decision time — the Sonnet 5 promo alone changes the answer mid-2026.
Is Claude more expensive on Bedrock than on the Anthropic API?
Not on the rate card — standard commercial-region Bedrock prices match Anthropic's list prices for the same models. Differences show up at the edges: AWS GovCloud carries a premium (Opus 4.8 is $6/$30 there), geographic cross-region profiles can run about 10% above global ones, and each platform's discount mechanisms differ. The platform choice is about IAM, billing consolidation, and ecosystem, not per-token price.
Why does my Bedrock bill not match my token estimates?
Almost always because agents re-send accumulating context on every loop turn. A 12-turn task whose conversation holds 23,000 unique tokens bills around 177,000 input tokens, a 7.7x multiplier that quadratic context growth makes worse with every additional turn. Estimate in dollars per task from measured traces, not dollars per million tokens from the rate card.
How much does prompt caching save on Bedrock?
Cache reads bill at 0.1x the input rate against a 1.25x one-time write (2x for the 1-hour TTL), so loop-heavy agent workloads typically save 60-70% on total task cost — our worked 12-turn example drops from $0.585 to under $0.19. Savings depend entirely on the cache actually hitting, which requires stable prompt prefixes and correct breakpoint placement. Verify with cache-read counts in the usage metadata before trusting the projection.
What is Provisioned Throughput and do agent teams need it?
Provisioned Throughput rents dedicated model units — fixed tokens-per-minute capacity — by the hour, with no-commitment, 1-month, and 6-month terms and cheaper hourly rates on longer commitments. It suits steady, always-on, high-volume serving where utilization stays high around the clock. Bursty agent traffic almost never fits that shape; most agent programs are better served by on-demand plus caching plus batch.
Can you negotiate AWS Bedrock pricing?
Not the public on-demand rate card — that is fixed. What is negotiable is the commitment layer: AWS Marketplace private offers and volume-based terms priced against a forecast of committed spend. The negotiation goes better when your consumption is already optimized, because caching, routing, and batch determine the volume you are committing to in the first place.
References
- Amazon Bedrock Pricing — AWS. https://aws.amazon.com/bedrock/pricing/
- AWS What's New — Amazon Bedrock now supports 1-hour duration for prompt caching (January 2026). https://aws.amazon.com/about-aws/whats-new/2026/01/amazon-bedrock-one-hour-duration-prompt-caching/
- Amazon Bedrock User Guide — Increase model invocation capacity with Provisioned Throughput. https://docs.aws.amazon.com/bedrock/latest/userguide/prov-throughput.html
- Amazon Bedrock AgentCore Pricing — AWS. https://aws.amazon.com/bedrock/agentcore/pricing/
- Anthropic — Introducing Claude Sonnet 5. https://www.anthropic.com/news/claude-sonnet-5
- Amazon Bedrock User Guide — Increase throughput with cross-Region inference. https://docs.aws.amazon.com/bedrock/latest/userguide/cross-region-inference.html