The Biggest Cost in AI Agent Systems — And How to Control It

The biggest cost in AI agent systems

Imagine you are paying a consultant by the word. Every word you speak to them, every document you hand them, every email you show them — you are billed for it. Now imagine that same consultant is reading an entire filing cabinet worth of documents every single time you ask them a simple question — even when most of those documents have nothing to do with what you asked.

That is what a poorly built AI agent does. And you are the one paying the bill.

What is a token, in plain English?

Every AI agent runs on a language model — Claude, GPT, Gemini, or similar. These models do not read words the way humans do. They read in chunks called tokens. One token is roughly four characters, or about three-quarters of a word.

When your agent does anything — reads an email, processes a form, responds to a customer — it sends information to the model and receives a response. Every piece of that exchange is measured in tokens. And tokens cost money.

Here is what that looks like in real numbers:

GPT-4o charges approximately $2.50 per million input tokens
Claude Sonnet charges approximately $3 per million input tokens
A single image sent to a model can consume between 1,500 and 3,000 tokens
The same information extracted as text might use 100 to 200 tokens

That gap — 1,500 tokens versus 150 tokens for the same information — is where businesses quietly bleed money every month.

Why this does not matter at first — and why it suddenly does

When you are testing an AI agent, you are sending a handful of requests a day. The cost is invisible. Then you go live. The agent starts handling 500 customer enquiries a day, processing 200 invoices a day, sending 1,000 follow-ups a week. Suddenly the token count is in the millions, and the monthly bill looks nothing like the estimate.

This is not a hypothetical. Companies have scrapped AI projects not because the technology failed — but because nobody managed the cost of running it.

Four things a competent developer must do

1. Prefer text over images — always

If your agent processes invoices, forms, purchase orders, or any document that contains text, that text should be extracted first using OCR (optical character recognition) and only the text sent to the model. Sending the raw image instead is like handing your consultant a photograph of a document instead of the document itself — and being charged accordingly.

A scanned invoice image might cost 2,000 tokens to process. The same invoice as extracted text costs 200. Across 500 invoices a month, that is the difference between 1 million tokens and 100,000 tokens — for identical output.

2. Compress images when visuals genuinely matter

Sometimes the image itself is what the agent needs — a product photo, a damaged goods claim, a site inspection image. In those cases, compress the image before sending it. A high-resolution image and a compressed version at half the size often produce the same result from the model. The token cost drops significantly.

3. Use prompt caching

Every AI agent carries a set of instructions it sends to the model with every request — who it is, what it is supposed to do, what rules it follows, what context it needs. This is called the system prompt. In a naively built agent, this full prompt gets sent — and paid for — with every single request.

Prompt caching changes this. Models like Claude allow you to cache the system prompt so that it is only charged once, and then reused at a fraction of the cost. Cache reads on Claude cost $0.30 per million tokens compared to $3.00 per million for standard input — a 90 percent reduction on every cached portion.

For an agent that handles 1,000 tasks a day with a 2,000-token system prompt, the difference between caching and not caching is roughly 1.8 million tokens a day. At standard rates, that is real money — monthly.

4. Do not use a language model for things that do not need one

This is where the most money is wasted, and it is the least obvious mistake.

Language models are designed for reasoning — understanding messy, ambiguous inputs and producing flexible, intelligent responses. They are not designed for deterministic tasks. Checking whether an order total exceeds a threshold, formatting a date, routing a message because it contains a specific keyword — these are jobs for regular code. Code does them instantly and for free. Sending them to a language model is like hiring a surgeon to change a lightbulb.

The right system uses both: code handles everything mechanical, the language model handles everything that requires understanding.

Token optimisation in AI agent systems

The number that should concern you

A McKinsey analysis found that among companies piloting AI tools, operational costs frequently exceeded initial estimates once the systems moved to production scale. Token inefficiency is a significant but underreported contributor. The businesses that manage AI costs well treat token optimisation the same way they treat cloud cost management — as an engineering discipline, not an afterthought.

If your developer cannot explain how they are managing token usage in your agent, that cost is coming to you.

FAQ

1. What exactly is a token in an AI system and why does it determine cost?

A token is the smallest unit a language model processes — roughly four characters or three-quarters of a word. Every piece of text your agent sends to the model and every response the model generates is measured in tokens. Model providers charge per token, both for what goes in and what comes out. At low volumes the cost is negligible. At production scale — thousands of tasks per day — it becomes one of the largest ongoing costs in running an AI agent.

2. How much does it actually cost to send an image to an AI model compared to text?

A typical image sent to a model like GPT-4o or Claude consumes between 1,500 and 3,000 tokens depending on resolution. The same information extracted as text would use 100 to 300 tokens. If your agent processes 500 documents a month as images rather than extracted text, you could be paying 10 to 15 times more than necessary for the same output. Over a year, that difference compounds significantly.

3. What is OCR and why should every AI agent that reads documents use it?

OCR stands for optical character recognition — it is the process of reading text from an image and converting it into plain text. If your agent handles invoices, purchase orders, forms, or any scanned documents, OCR should be the first step before anything reaches the language model. The extracted text costs a fraction of the image in tokens and produces the same — often better — results, because models generally reason more accurately on clean text than on image-encoded text.

4. What is prompt caching and how much money can it realistically save?

Prompt caching lets you store the fixed portion of your prompt — usually the system instructions, context, and rules — so the model does not charge full price for it on every request. On Claude, cached tokens cost $0.30 per million compared to $3.00 per million for uncached input — a 90 percent reduction. For an agent running 1,000 tasks daily with a 2,000-token system prompt, that is a saving of roughly 1.8 million tokens per day on the cached portion alone.

5. When should an AI agent use code instead of a language model?

Any time the task is deterministic — meaning the correct answer is always the same given the same input — code is the right tool. Routing messages by keyword, checking whether a number meets a threshold, formatting a date, sorting a list, counting items — these should all be handled by code. Language models are expensive and relatively slow compared to code for these tasks. Use the language model only where reasoning, understanding, or flexible interpretation is genuinely required.

6. What happens to costs when an AI agent scales from testing to production?

In testing, you might be sending a few dozen requests per day. The cost is trivial and easy to ignore. In production, the same agent might handle 10,000 requests per day. If the system was built without token efficiency in mind, the costs scale linearly — or worse, because production prompts often include more context than test prompts. Businesses regularly discover that their production AI costs are three to five times higher than their testing costs suggested.

7. How do I know if the developer building my AI agent is managing token costs properly?

Ask them directly: how are you handling prompt length, are you using caching, and which tasks in this workflow will be handled by code rather than the model? A developer thinking about efficiency will have specific answers. Someone who has not considered it will give you vague assurances. You should also ask for a projected cost per task at your expected volume — any serious developer should be able to give you this number before the build begins.

8. Is token optimisation only relevant for high-volume businesses?

No. Any business running an agent that processes more than a few hundred tasks per month should be paying attention to this. The threshold is lower than most people think. A small business running a lead follow-up agent that handles 50 interactions per day is already generating over 1.5 million tokens per month if the prompts are unoptimised. At standard rates, that is a meaningful monthly cost for a small operation.

9. Can using cheaper models instead of the best model reduce costs?

Yes — and this is a legitimate optimisation strategy. Not every task in an agent requires the most capable model. Routine classification, simple routing decisions, and templated responses can often be handled by smaller, cheaper models. The expensive, high-capability model should be reserved for tasks that actually require it. A well-architected agent uses the right model for each task, not the same model for everything.

10. What is the risk of not addressing token optimisation before building an AI agent?

The agent works but the economics do not. You build something that functions technically, go live, and then discover the monthly operating cost is two or three times what you expected. At that point, fixing it requires rearchitecting parts of the system — which costs time and money that would not have been needed if it were designed correctly from the start. Token optimisation is not a feature to add later. It is an engineering discipline that needs to be built in from the beginning.

11. Are there tools that automatically optimise token usage in AI systems?

Some platforms offer token usage dashboards and built-in caching. LangChain and LlamaIndex — popular frameworks for building AI agents — have some optimisation features. But tools cannot substitute for architectural decisions. Whether to use OCR, whether to cache, whether to use code versus a model — those are design choices that have to be made intentionally by the developer. Tools can help you measure and monitor, but they do not make the right decisions automatically.

12. How should I compare AI agent providers on cost, not just capability?

Ask for a cost-per-task estimate at your expected volume, broken down by what contributes to that cost. Ask specifically about caching strategy, image handling, and how deterministic tasks are managed. Then compare providers not just on what the agent can do but on what it costs per month when it is actually doing that thing at scale. A capable agent that is expensive to run is often worse than a slightly less capable agent that is efficiently built.