S3 Vectors for Enterprise RAG: Cost Math and Latency Limits

Amazon S3 Vectors is object storage with native support for storing and querying vector embeddings — $0.06 per logical GB-month to store, $0.20 per GB to write, and queries metered per call instead of billed per cluster-hour. It went generally available on December 2, 2025, scaled to two billion vectors per index, and AWS's headline claim is a total cost reduction of "up to 90%" versus specialized vector databases. The claim is real. It is also scoped, and the scope is the whole story: an S3 vector index is the right store when your corpus is large, cold, and queried politely, and the wrong one when an interactive agent is waiting on the other end of every call.

Here is the number that reframes the decision. AWS's own pricing example puts a 10-million-vector RAG corpus with one million queries a month at $11.38. The OpenSearch Serverless production floor — two compute units running before you store a single vector — is about $350 a month. That is a 30x gap on the same workload, and it is not a rounding artifact. It is a different billing philosophy: durability-priced storage with a meter on reads, instead of an always-on cluster you rent whether anyone queries it or not.

The first time we priced a client's 40-million-chunk document archive three ways, the S3 Vectors column looked like a typo. This article shows that math at three corpus scales, then draws the line the pricing page will not draw for you: where the cheap store breaks agent UX, and how the hybrid pattern — hot vectors in OpenSearch, cold corpus in S3 — gets you both numbers at once.

What an S3 Vector Bucket Actually Is (and Is Not)

S3 Vectors introduces a new bucket type, the vector bucket, which holds vector indexes instead of objects. You create an index with a fixed dimension count (1 to 4,096), write vectors with a key and optional metadata, and run similarity queries against it through a dedicated API. No cluster sits underneath any of it. At GA, one index holds up to two billion vectors — a 40x jump from the 50-million preview limit — and one bucket holds up to 10,000 indexes.

The deliberate trade

Every vector database before this one made the same bet: vectors belong in RAM next to a running process, and you pay for that process around the clock. S3 Vectors bets the other way — vectors live in object storage at object-storage prices, and the query path is metered per call. AWS states the consequence plainly: infrequent queries return in under one second, while frequently accessed indexes serve queries at around 100 milliseconds or less. Warm is fast-ish. Cold is sub-second, not sub-100ms. You are trading tail latency for a bill that scales with use instead of with time.

What it is not

It is not a search engine. No BM25, no hybrid lexical-plus-vector scoring, no aggregations, no faceting — the features that make OpenSearch a product rather than an index. It is also not the retrieval strategy itself; whether agents should query a vector store at all, versus calling tools over live systems, is the MCP versus RAG question, and it comes first. S3 Vectors answers exactly one question: where do embeddings live, and what does it cost to ask them things.

S3 Vectors Economics: One Corpus, Three Price Tags

Cost tables in vendor posts tend to hide their assumptions, so here are ours: 1,024-dimension float32 vectors — 4 KB of vector data plus roughly 2 KB of metadata and key, the same 6.17 KB composition AWS uses in its own pricing examples. Query volumes are 250K, 1M, and 10M per month across the three scales. Prices are us-east-1, verified against the AWS pricing pages in July 2026.

Corpus	S3 Vectors	OpenSearch Serverless	pgvector on RDS
1M vectors (~6 GB)	~$4.50/mo (our estimate at published rates)	~$350/mo — the 2-OCU production floor	~$330/mo (db.r6g.xlarge, single-AZ) plus storage
10M vectors (~60 GB)	$11.38/mo (AWS's worked example)	~$700-1,400/mo — 4-8 OCUs once the index outgrows the ~6 GB RAM per OCU	~$660-1,300/mo (db.r6g.2xlarge, single vs Multi-AZ)
500M vectors (~2.9 TB)	$1,320.47/mo (AWS's worked example, 10M queries)	Five figures — OCU counts climb into the dozens	Not a line item — a sharding project

Three things in that table deserve a second look.

Storage is nearly free; reads are not

Decompose AWS's 500-million-vector example: $176.52 for storage, $98.07 for PUTs, $1,045.88 for queries. Queries are 79% of the bill. The economics have not made retrieval free — they have moved the meter from time to usage. At 10 million queries a month the meter is still a bargain. Push sustained load toward tens of queries per second and it crosses what a dedicated cluster costs — exactly the point where you should not be on the metered store anyway.

The floor is the killer at small scale

For the one-million-vector corpus, the interesting number is not what S3 Vectors costs. It is what the alternatives cost for doing nothing. An OpenSearch Serverless collection with high availability bills a minimum of 2 OCUs at $0.24 per OCU-hour — roughly $350 a month of idle floor. A pgvector instance big enough to take seriously runs about the same. S3 Vectors has no floor. A pilot that would have carried a four-figure annual infrastructure line now costs about as much as a sandwich.

Scoping the 90% claim

AWS's wording is "reduce the total cost of storing and querying vectors by up to 90% when compared to specialized vector database solutions." The "up to" is doing honest work. The claim holds best when the baseline is an always-on cluster sized for a corpus that gets queried rarely — the cold-archive case, where 30x gaps show up. It compresses toward parity as query volume rises, and it says nothing about latency, which is the dimension you gave up. Repeat the number; keep the scope attached.

One sentence on the adjacent decision: whether your inference runs on Bedrock or on SageMaker endpoints is a separate control-plane choice with its own economics, and we priced it in Bedrock versus SageMaker. The token side of a RAG bill — embeddings and generation — lives in the Bedrock pricing guide.

The Latency Budget: Agentic RAG Changes the Math

A human running a search tolerates a 700ms retrieval without noticing. An agent does not run one retrieval. Agentic RAG means the model plans its own retrievals — decompose the question, query, read, re-query. A multi-hop task routinely issues four to eight retrieval calls, sequentially, because each query depends on what the last one returned. Retrieval latency does not add to an agent task; it multiplies through it. This is the piece most coverage of the service misses, and it is the variable that decides the architecture.

Two timelines comparing retrieval latency budgets: an interactive agent making 6 sequential retrieval calls at roughly 700 milliseconds each accumulates 4.2 seconds of pure retrieval and blows a 5-second response budget, while a background research agent making 80 calls over a 20-minute run spends 56 seconds retrieving — about 5 percent of wall clock, which nobody notices.

Where it breaks: interactive agents

Run the arithmetic against AWS's stated numbers. A warm, frequently queried index at ~100ms per call: six hops cost 0.6 seconds of retrieval. Workable inside a five-second response budget. Now the honest case — the reason you chose this store is that most of the corpus is not frequently queried. At the under-one-second cold end, call it 700ms per query: six hops cost 4.2 seconds of pure retrieval before the model generates a single token. Add two model turns at a couple of seconds each and the user is staring at a spinner for nine-plus seconds. That is not a degraded experience. That is an abandoned one.

Where nobody cares: background agents

Flip the workload. A compliance research agent grinds through a 20-minute run and issues 80 retrieval calls along the way. At 700ms each, retrieval consumes 56 seconds — about 5% of wall clock, invisible next to inference time. Batch document triage, overnight report generation, eval pipelines replaying thousands of retrievals: all of them burn latency where no human is watching. Paying an always-on cluster to shave 600ms off calls nobody is waiting for is the least defensible line in the budget.

The question to ask is not "is 100-700ms fast enough?" It is: how many sequential retrievals sit between the user and the answer, and is a human watching the gap?

The Decision Framework: Three Axes

Every store-selection argument we have refereed reduces to three measurable axes.

Query rate. Below roughly one sustained query per second — a few million calls a month — metered pricing wins by an order of magnitude or more. At sustained tens of QPS, the meter crosses the cluster, and latency pressure usually arrives before the cost crossover does.
Corpus temperature. What fraction of vectors gets touched in a given week? Enterprise corpora are brutally long-tailed — the contracts from 2019 must be searchable and almost never are. Durability-priced storage exists for exactly that cold tail.
Latency tolerance. Sequential agent hops in front of a waiting human demand the warm path or a real cluster. Background and batch consumers do not.

When all three align — big, cold, asynchronous — the cheapest store is simply the correct one, and paying for OCUs would be architectural vanity. When they split, you do not have to choose. That is the tiering pattern.

Hybrid Tiering: Hot Vectors in OpenSearch, Cold Corpus in S3

The pattern that was a duct-taped workaround during preview is now first-class: AWS ships two GA integration modes between S3 Vectors and OpenSearch, and they map to two different intents.

Architecture diagram of hybrid vector tiering: a query router sends the hot path to an Amazon OpenSearch tier holding roughly 5-10 percent of vectors with millisecond-class queries, and the long-tail path to an S3 Vectors bucket holding 90-95 percent of the corpus at 6 cents per GB-month; between the tiers, OpenSearch Ingestion promotes hot subsets up and the S3 vector engine mode offloads cold data down.

Mode one: export to OpenSearch Serverless

An OpenSearch Ingestion pipeline exports a chosen S3 vector index into an OpenSearch Serverless vector collection. This is promotion: the subset of the corpus taking real interactive traffic gets copied up into RAM-backed infrastructure with full OpenSearch capability — hybrid lexical-plus-vector scoring, aggregations, rich filtering. The cold master copy stays in the bucket at $0.06 per GB.

Mode two: S3 as the vector engine for OpenSearch Service

The inverse direction: OpenSearch Service can use Amazon S3 as the vector engine behind an index, offloading vector data to object storage while keeping sub-second query capability through the familiar OpenSearch API. Same tiering economics, but the application talks to one endpoint and the tier boundary becomes an implementation detail rather than a routing decision in your code.

In practice the split lands at 5-10% hot. Whatever the access logs say is actually warm — last-two-quarters documents, the index behind the customer-facing assistant — lives in OpenSearch. The other 90-95% sits in the vector bucket, fully queryable, at storage prices. The expensive tier stays small because promotion is a pipeline run, not a migration.

Operational Notes for Enterprise RAG

A few things that will bite you if you find them in production instead of here.

Bedrock Knowledge Bases support is GA

When you create a Knowledge Base in Amazon Bedrock or SageMaker Unified Studio, you can point it at an existing S3 vector index or let Quick Create provision one. For teams standardizing on Bedrock for enterprise RAG, this removes the last excuse for defaulting the KB store to an OpenSearch collection nobody sized deliberately — which is how most of the surprise line items we audit got born.

Metadata limits shape your filter design

Each vector carries up to about 40 KB of total metadata, but only 2 KB of it can be filterable, across a maximum of 50 keys, with up to 10 keys per index declared non-filterable. Two consequences. First, filterable space is scarce: spend it on what queries actually constrain — tenant ID, ACL tags, document type, date — and push chunk text into non-filterable metadata. Second, the split is permanent: non-filterable keys are declared at index creation and cannot be flipped later. Schema mistakes here are reindexing events, not config changes.

Index layout is a cost and isolation lever

AWS's own examples shard one corpus into 40 per-tenant indexes, and the instinct is right: query charges scale with the size of the index scanned, so smaller targeted indexes cost less per call and give you tenant isolation for free. With 10,000 indexes per bucket, index-per-tenant is a viable enterprise layout, and streaming writes sustain up to 1,000 PUTs per second for near-real-time inserts.

Re-run your retrieval evals after the swap

Different stores rank differently. A cheaper ANN index is not guaranteed to return the same top-k, and recall differences that look small on paper move answer quality in ways only measurement catches. Treat a vector-store migration like a model swap: run the harness from our enterprise RAG evaluation guide before and after, and gate the cutover on retrieval metrics, not on the bill.

When Not to Use S3 Vectors

The honest list, because the pricing is seductive enough to get misapplied:

High-QPS semantic search. Product search, recommendation feeds, anything at sustained tens of queries per second — the query meter and the latency profile both point at a cluster.
Strict interactive latency budgets. If a human waits on sequential retrievals, cold-path sub-second is not good enough. Warm it up in an OpenSearch tier or do not put it here.
Heavy filtered-query patterns. 2 KB of filterable metadata is tight, and complex multi-predicate filtering is a real database's job.
Hybrid lexical-plus-vector retrieval. No BM25 means no true hybrid scoring inside the store; if your relevance depends on it — and for legal and support corpora it usually does — you need OpenSearch in the path.
Anything needing aggregations or analytics over the vector set. It is an index, not an engine.

None of these are flaws. They are the terms of the trade. The store is cheap because it declines to do the expensive things, and the architecture wins when you route around that on purpose instead of discovering it in an incident review.

Frequently Asked Questions

What is Amazon S3 Vectors?

Amazon S3 Vectors is native vector storage and similarity search built into S3, generally available since December 2, 2025. A vector bucket holds up to 10,000 indexes, each storing up to two billion vectors of 1 to 4,096 dimensions, priced at $0.06 per logical GB-month plus metered per-query charges — no clusters or capacity units to provision.

How much does S3 Vectors cost for a typical RAG corpus?

AWS's worked examples price 10 million vectors with one million monthly queries at $11.38 per month, and 500 million vectors with ten million queries at about $1,320. The same 10-million-vector corpus on OpenSearch Serverless starts near the 2-OCU floor of roughly $350 a month and grows with RAM — at low query rates the gap is more than an order of magnitude.

Is S3 Vectors fast enough for agentic RAG?

It depends on who is waiting. AWS states around 100 milliseconds or less for frequently queried indexes and under one second for infrequent ones. A background research agent making dozens of sequential retrievals absorbs that easily, but an interactive agent making six cold retrievals in front of a user accumulates about four seconds of pure retrieval — enough to break the experience. Route interactive hot paths to a warm tier.

Does Amazon Bedrock Knowledge Bases support S3 Vectors?

Yes, the integration is generally available. When creating a Knowledge Base in Amazon Bedrock or SageMaker Unified Studio you can select an existing S3 vector index or provision one through Quick Create, making it a drop-in store for Bedrock-based RAG pipelines.

What is the difference between S3 Vectors and OpenSearch for vector search?

S3 Vectors is durability-priced storage with metered queries: very low cost, sub-second but not consistently sub-100ms, no lexical search or hybrid scoring. OpenSearch is an always-on engine: millisecond-class, full search features, and a floor cost that runs whether or not anyone queries it. AWS ships GA integrations in both directions, so the practical answer for enterprise RAG is usually both, tiered.

When should I avoid S3 Vectors entirely?

Skip it for sustained high-QPS workloads like product search, for interactive experiences with strict p50 latency budgets, for retrieval that depends on heavy metadata filtering beyond the 2 KB filterable limit, and for anything requiring hybrid BM25-plus-vector relevance. Those workloads belong on OpenSearch or a comparable engine, with the vector bucket holding the cold tail behind them.

References

AWS News Blog — Amazon S3 Vectors now generally available with increased scale and performance. https://aws.amazon.com/blogs/aws/amazon-s3-vectors-now-generally-available-with-increased-scale-and-performance/
Amazon S3 pricing — S3 Vectors rates and worked examples. https://aws.amazon.com/s3/pricing/
Amazon S3 Vectors feature page — capabilities and integrations. https://aws.amazon.com/s3/features/vectors/
Amazon S3 documentation — S3 Vectors limitations and restrictions. https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors-limitations.html
Amazon S3 documentation — Using S3 Vectors with OpenSearch Service. https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors-opensearch.html
Amazon OpenSearch Service pricing — Serverless OCU rates and minimums. https://aws.amazon.com/opensearch-service/pricing/
Amazon RDS for PostgreSQL pricing — db.r6g instance-hour rates used in the pgvector column. https://aws.amazon.com/rds/postgresql/pricing/