Practitioner notes from wiring MCP servers into production chat interfaces. Cost, attention, and tool-set discipline.


When I first wired up MCP servers to talk to a chat interface, I had the notion that having more tools hooked up would lead to better results. Wire up the database, wire up the docs, wire up the APIs, and the AI will just KNOW MORE THINGS. The good thing about that notion is that it is basically true. The bad thing is that the false part is actually really, really expensive.

Here's what I learned from wasting too many tokens on sessions that went downhill slowly and mostly unnoticed until they fell apart completely.

The thing nobody tells you about tool results in context

All the results from your MCP tools (like a fetched document, an API schema, or a database query) are displayed in the context window. However, the distribution of the LLM's attention over a long context is not even. The start and the end of the long context receive very high attention, while the middle parts receive increasingly worse attention. As a session grows, early tool results will drift into the low-attention zone of the long context.

A U-shaped line graph showing 'Attention High' at the start (System Prompt/Early Tools) and end (Current Turn) of a context window, with a deep 'Attention Low' dip in the middle where older tool results are stored.
A U-shaped line graph showing 'Attention High' at the start (System Prompt/Early Tools) and end (Current Turn) of a context window, with a deep 'Attention Low' dip in the middle where older tool results are stored.
A U-shaped graph illustrating the 'Lost in the Middle' phenomenon, where the LLM's attention is highest at the beginning (system prompts) and end (latest turns) of the context window, but significantly lower for tool results buried in the middle. This failure mode is silent. The model does not say "I have lost track of that". Instead, it continues to answer confidently, but from a degraded representation, perhaps with an API signature that is subtly wrong, or with a constraint that it previously knew three turns ago but is now ignoring. The results from a tool have a freshness window. They are most reliable immediately after the model has retrieved them for the first time. They are least reliable after several turns that are unrelated to the tool result, after which it has been buried in the middle of the conversation.

Re-fetching is not waste — it's a quality decision

My original thought was not to re-fetch the data as it was already in scope, why pay for more data that would just return the same results. That was completely the wrong approach.

Results from a newly started session are positioned at the end of a very short context. They are in the highest attention zone. Of course, there is a cost factor here. If your AI platform supports cache reads (which most major LLM providers do), then re-fetching a previously retrieved document will typically result in a cache read instead of a read from the original location. The cost for cache read tokens is typically 90% below the cost for input tokens of the same length (e.g. 90% below $3.00/M for M characters for GPT-4o and Gemini, $0.30/M for Claude at $3.00/M) [2][3][4]. So even though re-fetching a document (previously retrieved in a session) in a new session has the highest quality, it is also cheapest compared to degrading a document over the turns of a conversation and keeping that degraded copy.

A side-by-side comparison: a long session showing high input costs and degraded response quality versus multiple fresh sessions using 'Cache Reads' for tool schemas at 90% lower cost with high response accuracy.
A side-by-side comparison: a long session showing high input costs and degraded response quality versus multiple fresh sessions using 'Cache Reads' for tool schemas at 90% lower cost with high response accuracy.

Instead of fetching data for a particular analysis right at the start of a session for convenience, fetch close to the turn where that data is actually needed. Starting a fresh session for each sub-question in turn is nearly always the correct thing to do — not because you lose something important by not having previous write findings external to tool, but because a very short context is best for every tool result.

Prompt caching is the actual cost lever

Note that prompt caching tiers at greatly reduced input prices (80-90% below standard input price) are available for deeply reduced pricing from all of the major LLM providers, Claude not least.

ProviderStandard InputCache ReadDiscount
Claude Sonnet (Anthropic)$3.00 / 1M$0.30 / 1M90% off
GPT-4o (OpenAI)$2.50 / 1M$1.25 / 1M50% off
Gemini 1.5 Pro (Google)$1.25 / 1M$0.31 / 1M~75% off

Pricing sources: [2] [3] [4]

The following things repeat on every turn in an MCP-integrated chat: system prompt, tool definitions (i.e. the definition for each tool registered with MCP — i.e. the full tool schema for that tool), and static configuration (i.e. that which is not changing for the duration of the chat — and thus could be cached). If you're not explicitly caching out these large, static pieces of information which are sent on every turn, then you are paying full price for the same bytes on every turn.

However, the scaffolding around the conversation actually under investigation (i.e. your messages, tool output, model generated responses) does get cached from turn to turn. So while the key parts of your input (i.e. the conversation actually under investigation) will be re-sent in every turn of the session, the scaffolding around that conversation to support tool investigation will be cached. This can save most of your input bill for a session of 20 turns where the system prompt (plus all the tool schemas) totals 6,000 tokens.

Most platforms provide caching of data via cache-control headers or similar settings exposed via API. It's worth spending an hour to make sure you've got this set up correctly — the default is typically to not cache anything.

Too many tools loaded is its own problem

This one surprised me as well. I had gone through and registered all of our MCP servers to the model thinking that they would be ignored by the model as needed. There were two problems with this.

First, each tool definition takes token budget on each turn, whether it is actually called or not. Tool schemas are added to the model's context on each request, and charged for the same bytes each time. A session with 30 tools has far greater overhead per turn than a session with 5 tools scoped to that.

Second, my assumption was wrong. Loading more tool definitions does not help the model ignore the irrelevant ones — it widens the surface area for picking the wrong one. The model ends up selecting similar tools from completely different domains, or inventing argument patterns by averaging across too many frameworks it has seen.

There are hard limits but they are provider and API surface specific. More relevant though is a practical ceiling which hits long before the hard cap.

ProviderHard LimitBest PracticeNotes
Claude (Anthropic)No hard count limit — context window is the effective ceilingUse <10 tools loaded upfront; use deferred Tool Search Tool for larger setsAnthropic's engineering blog shows 50+ MCP tools = 55K–72K tokens before conversation starts (e.g. GitHub alone: 35 tools, ~26K tokens); official threshold for switching to deferred/on-demand loading is 10+ tools or >10K tokens in definitions
GPT-4o (OpenAI)64 (legacy functions endpoint); 128 on Azure OpenAI AssistantsLimit tools up front; shorten descriptionsConflicting limits across API surfaces — legacy vs. new tools schema, OpenAI-hosted vs. Azure Foundry
Gemini (Google)512 function declarations (returns 400 error)Keep active set to 10–20 (explicitly stated in docs)512 is an empirical hard cap; accuracy degrades well before that
AWS Bedrock (Converse API)No maximum published — ToolConfiguration docs state only "minimum 1 item"; no tool count quota in AWS service limits pageFollow the underlying model's guidanceEffective ceiling is the hosted model's context window and its own limits (e.g. Claude-on-Bedrock inherits Claude's thresholds)

Sources: [1] [3] [4] [5] [6]

All the Gemini's have very blunt guidance but it applies. The practice I have found most useful is load only the tools for the session goals. So if the session is about querying a knowledge base then do not load Jira and GitHub etc. Tools scope to session scope.

I no longer want to manually pick tools from a large pool of potential candidates that could help me with a specific problem only to over select and load "just in case" tools. On the flip side, under selecting and not having a critical tool to perform tasks would be equally terrible. Jarvis Registry [8] provides the much needed functionality of context-based discovery [7] (skill, description, declared capabilities via vector search) that pins the correct set of MCP servers to query based off the natural language of the query. The caller doesn't need to know the servers even exist and only the relevant tools show up while respecting access controls. The rest of the tools that aren't relevant to the problem at hand will remain out of context to avoid confusion.

A workflow diagram showing a user query being processed by a registry that performs vector search to select a specific subset of relevant MCP tools from a large catalog, rather than loading all available tools into the LLM context.
A workflow diagram showing a user query being processed by a registry that performs vector search to select a specific subset of relevant MCP tools from a large catalog, rather than loading all available tools into the LLM context.

What I do now

My real workflow for any more complex task utilizing MCP tools:

  • Define the session scope first. Write one sentence describing the specific question this session will answer. If it spans more than two or three distinct tool domains, split it into separate sessions.
  • Load only the tools this session needs. Don't activate Jira and GitHub if the session is about querying a knowledge base. Scope the tool set to the scope of the task.
  • Fetch close to the turn that uses the data. Don't pull documents and tool results at session start for convenience — fetch immediately before the turn that needs them.
  • Watch context usage at ~40%. Treat it as a signal to wrap up and start fresh, or at minimum re-fetch the most critical sources so they're back in the high-attention zone.
  • Write findings externally, immediately. Any schema detail, confirmed behavior, or design decision goes to a note file right away. Chat history is not a reliable artifact.
  • Start fresh when the model feels off. Repetition, self-contradiction, or subtly wrong answers are signals that context has degraded past the useful threshold. Don't try to correct it in-session — open a new one.

This is all straightforward and requires no complexity. It's all about treating your context window as a work surface of limited dimensions, not a repository to add more and more stuff to.


For building large-scale MCP integrations across multiple teams, where session discipline, tool-level access control, and cost observability translate into organizational issues as opposed to individual habits, we recommend evaluating Jarvis Registry [8] and Jarvis Chat from ASCENDING Inc. The Jarvis Registry provides context-based discovery [7] of the appropriate tools for each session by performing vector search over an enterprise catalog of tools. It also enables centralized governance, per-tool access control for each tool as well as complete audit trails. Jarvis Chat exposes the governed set of tools to the correct users through a multi-LLM chat interface. The patterns described above are encoded into the platform as opposed to relying on individuals to follow certain practices.


References

  1. Anthropic Engineering — "Introducing advanced tool use on the Claude Developer Platform" (tool count thresholds, token cost examples, Tool Search Tool): https://www.anthropic.com/engineering/advanced-tool-use
  2. Anthropic — Claude API pricing (cache read vs. standard input rates): https://www.anthropic.com/pricing
  3. OpenAI — Function calling and tools reference (64-tool legacy limit, 128-tool Azure Assistants limit): https://platform.openai.com/docs/guides/function-calling
  4. Google — Gemini function calling documentation (512 hard cap, 10–20 recommended active set): https://ai.google.dev/gemini-api/docs/function-calling
  5. AWS — Amazon Bedrock ToolConfiguration API reference (minimum 1 item, no maximum documented): https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_ToolConfiguration.html
  6. AWS — Amazon Bedrock service quotas (no tool count quota listed): https://docs.aws.amazon.com/general/latest/gr/bedrock.html
  7. Jarvis Registry — Context-Based Discovery feature: https://jarvisregistry.com/FEATURES/#4-skill-context-based-discovery
  8. ASCENDING Inc — Jarvis Registry product page: https://ascendingdc.com/jarvis-ai/jarvis-registry/