A few patterns that inflate bills unnecessarily: - Streaming for non-interactive use cases: Streaming adds latency overhead and is pointless for batch pipelines. Turn it off where you don't need it. - Huge max_tokens settings: If you're setting `max_tokens: 4096` defensively when most responses are 200 tokens, you're not wasting money on the limit itself — but you're inviting verbose responses. Set a realistic limit. - No timeout on hung requests: A hung API call that waits 5 minutes before fail

LLM API Cost: Prompt Caching, Batching, and Model Routing

LLM API costs are controllable. Prompt caching alone cuts 50-90% off repeated calls. Here's every lever worth pulling: caching, batching, and model routing.

By Andrii Votiakov on 2026-04-01

LLM API costs follow a predictable trajectory: low at the prototype stage, then quietly aggressive once you're in production with real users. The usual culprit isn't the price per token — it's that nobody optimised the call patterns. The same system prompt gets sent fresh on every request. The cheapest capable model wasn't selected deliberately. Responses get generated synchronously when they could batch overnight at half price. These are all fixable.

Quick answer

The four highest-impact levers are: prompt caching (50% off with OpenAI, up to 90% off with Anthropic for repeated prefixes), batch API (50% off for non-real-time jobs), model routing (use Haiku/Mini/Flash for classification and triage, save the big models for generation), and context length management (trim conversation history aggressively). Applied together, most production LLM bills drop 50-75%.

Prompt caching: the biggest lever you're probably not using

Anthropic prompt caching

Anthropic's prompt caching is explicit: you mark parts of your prompt with a cache_control block. When the same prefix is sent again within the cache TTL (5 minutes by default, 1 hour for extended cache), you pay 10% of the normal input token cost for those cached tokens.

Cache write cost is 25% more than normal input tokens (a one-time hit on the first call). After that, reads are 90% cheaper. If your system prompt is 2,000 tokens and you send 10,000 requests per day, you write the cache once and read it 9,999 times. At Claude Sonnet pricing (~$3/million input tokens), that's the difference between $60/day and $6/day for the system prompt portion alone.

The implementation is a small change to your API call structure. The cache sticks as long as the prefix is byte-for-byte identical — dynamic injection anywhere before the cached block breaks it.

OpenAI prompt caching

OpenAI's caching is automatic for prompts over 1,024 tokens. You don't opt in. If the same prefix appears in repeated calls within a session, cached tokens cost 50% less. There's no write premium.

The catch: OpenAI's cache is less deterministic than Anthropic's. It's session-scoped and not guaranteed. In practice, high-repetition production workloads typically see 40-60% cache hit rates, which translates to 20-30% off the input token cost for cacheable portions.

To maximise cache hits on OpenAI: put the system prompt first, put static context before dynamic content, and keep the variable part (user message) at the end.

Batch API: 50% off for non-real-time work

Both OpenAI and Anthropic offer batch APIs that process requests asynchronously within a 24-hour window, at 50% of the real-time price.

This is genuinely useful for:

Document processing (PDFs, reports, transcripts)
Bulk classification or tagging pipelines
Overnight summarisation jobs
Evaluation runs on test sets
Embedding generation at scale

If you have a pipeline that processes 100,000 documents per day and doesn't need real-time results, batch API cuts that bill in half. The integration is a file upload of JSONL requests, then a polling call to retrieve results.

The main mistake I see: people use synchronous API calls for workloads that don't need to be synchronous, out of habit. Queue the work, batch it, retrieve results. Not every pipeline needs sub-second latency.

Model routing: use the cheapest model that can do the job

This is where teams leave the most money on the table. A single application will often make several different types of LLM calls, and they don't all require the same capability.

Rough routing guide:

Task	Capable model
Classification, routing, yes/no	GPT-4o-mini, Claude Haiku, Gemini Flash
Extraction from structured text	GPT-4o-mini, Claude Haiku
Summarisation (short)	GPT-4o-mini, Claude Haiku
Summarisation (long, complex)	GPT-4o, Claude Sonnet
Code generation	GPT-4o, Claude Sonnet 3.7
Reasoning, multi-step	GPT-4o, Claude Sonnet 3.7
Creative or nuanced generation	GPT-4o, Claude Sonnet 3.7, Opus

The price gap between the tiers is large. As of early 2026:

Claude Haiku 3.5: $0.80/million input, $4/million output
Claude Sonnet 3.7: $3/million input, $15/million output
GPT-4o-mini: $0.15/million input, $0.60/million output
GPT-4o: $2.50/million input, $10/million output

Running everything through Sonnet/GPT-4o when 70% of your calls are classification tasks is the most common over-spend I find. Route classification to Haiku or Mini. Reserve the expensive models for generation and reasoning.

Context length management

Tokens cost money in both directions. Long conversations accumulate context that gets resent in full on every turn. A chat application with 50-turn conversation history might be sending 15,000 tokens of context on every call — most of it irrelevant to the current question.

Strategies:

Sliding window: Keep only the last N turns in context (N = 10 is usually enough for most chat apps)
Summarise old turns: After 20 turns, summarise the first 10 into a compact summary, drop the full transcripts
Selective retrieval: For document-heavy tasks, use embeddings + vector search to pull only the relevant passages, not the whole document
Trim system prompts: Review your system prompt every quarter. Production system prompts tend to accumulate cruft. Every 500 tokens you trim saves money on every single call.

A typical chat application can reduce average context length by 40-60% with a sliding window and turn summarisation, with no measurable quality loss for most use cases.

Structured outputs and token efficiency

Unstructured prose responses are token-expensive. If you're calling an LLM to extract data and then parsing the result, you can often get a shorter, more reliable response by asking for JSON output.

OpenAI's Structured Outputs feature (enforced JSON schema) and Anthropic's tool use / JSON mode both reduce output token count for extraction tasks. A classification response that was "The sentiment of this text is positive because..." becomes {"sentiment": "positive"}. A 200-token response becomes a 12-token response.

For pipelines doing extraction or classification, structured outputs commonly reduce output token usage by 60-80%.

Embedding cache

If you're building a retrieval-augmented generation (RAG) pipeline, you're likely generating embeddings repeatedly for the same documents. Embeddings are cheap (OpenAI's text-embedding-3-small is $0.02/million tokens) but at scale they add up, and regenerating the same embeddings is pure waste.

Store embeddings in a vector database (Pinecone, Qdrant, pgvector) or even a simple Redis cache keyed by content hash. Once a document is embedded, never embed it again unless the content changes.

What to avoid

A few patterns that inflate bills unnecessarily:

Streaming for non-interactive use cases: Streaming adds latency overhead and is pointless for batch pipelines. Turn it off where you don't need it.
Huge max_tokens settings: If you're setting max_tokens: 4096 defensively when most responses are 200 tokens, you're not wasting money on the limit itself — but you're inviting verbose responses. Set a realistic limit.
No timeout on hung requests: A hung API call that waits 5 minutes before failing is 5 minutes of compute cost (if you're running inference yourself) and potentially a billing event. Set tight timeouts and retry with backoff.

See /blog/self-hosting-open-source-llms for when it makes sense to run your own inference instead of paying API costs entirely.

Realistic numbers

A recent client running a document processing pipeline (~4 million API calls/month):

Implemented Anthropic prompt caching on 8,000-token system prompt: -68% on input tokens for that prompt
Moved document triage step from Sonnet to Haiku: -72% on that call type (40% of total volume)
Switched batch classification jobs to Batch API: -50% on those jobs
Added sliding window (last 8 turns) to conversation endpoints: -44% on context tokens

Before: $8,400/month. After: $2,100/month. The pipeline quality metrics didn't change measurably.

If your LLM API bill is growing faster than your usage and you want to work through the specifics, book a call. Most of these changes take a day or two to implement.