When I look at AI API costs for projects that feel too expensive, the same patterns show up almost every time. One model used for everything. System prompts that grew over six months and nobody cleaned up. Caching available but never enabled. Output length uncapped. None of these are hard to fix — they just require someone to actually look at the bill and trace it back to the code.
I've watched teams go from $800/month to $280/month in a week without changing a single thing about their product experience. These are the changes that actually move the number, ranked by how much impact they tend to have.
This is the highest-leverage change most teams can make, and it's usually underused because it requires classifying requests upfront. The insight is simple: not every request needs your best and most expensive model.
In a typical AI-powered application, requests break down into roughly three tiers:
A team that routes 70% of requests to mini models and 30% to flagship models typically cuts their bill by 50–70% compared to running everything on the flagship. The engineering effort is a routing classifier — often just a small model call or a rule-based system — that pays for itself in the first week.
If you're not using prompt caching and you have a system prompt longer than ~500 tokens, you're paying full price for tokens you're sending identically on every request. Both Anthropic and OpenAI cache repeated prompt prefixes at a discount:
For a 2,000-token system prompt on Claude 3.5 Sonnet across 50,000 requests/month: without caching, that's $300 in input tokens just from the system prompt. With caching at 10%, that's $30. The rest of your input costs are unchanged.
Caching works best when the prefix of your prompt is stable across requests — system instructions, persona descriptions, tool definitions, and document contexts that don't change per-request. Structure your prompts so the stable part comes first, dynamic content (user message, conversation history) comes last.
System prompts bloat over time. You add a new instruction here, a new edge case there, and six months later your system prompt is 3,000 tokens of overlapping, redundant, sometimes contradictory instructions. Every redundant token costs money on every request.
A prompt compression pass typically cuts system prompt length by 30–50% with no quality loss. The process:
On a 3,000-token system prompt compressed to 1,800 tokens, running 80,000 requests/month on GPT-4o, that's 96 million fewer input tokens — $240/month in savings. Just from cleaning up the prompt.
Both OpenAI and Anthropic offer batch APIs that process requests asynchronously (within hours or overnight) at 50% of standard pricing. If any part of your workload doesn't need immediate responses — nightly data processing, content generation pipelines, bulk classification, weekly report generation — batch is an easy switch.
OpenAI Batch API pricing on GPT-4o: $1.25/$5.00 per million tokens (vs $2.50/$10.00 standard). Same model, same quality, half the cost. The only trade-off is latency — requests complete within 24 hours rather than seconds.
Output tokens cost 4–5× more than input tokens. If your model is generating longer responses than necessary, you're paying a significant premium. A few things that help:
Cutting average output from 350 tokens to 220 tokens across 100,000 monthly requests on GPT-4o saves $130/month from output alone. Across a year, that's $1,560.
This is different from prompt caching — it's application-level caching of the actual model response. If the same or very similar questions get asked repeatedly (a common pattern in support chatbots, FAQ tools, and search-augmented applications), caching the response entirely eliminates the API call.
Semantic similarity caching (using embedding similarity to find near-matches) can catch paraphrased versions of the same question. Even a 10–15% cache hit rate meaningfully reduces your token spend.
| Optimization | Typical savings | Effort |
|---|---|---|
| Model routing (70% to mini) | 40–65% | Medium — needs a classifier |
| Prompt caching | 20–40% on input | Low — restructure prompt prefix |
| System prompt compression | 10–25% | Low — one-time cleanup |
| Batch API for async work | 50% on batch requests | Low — switch API endpoint |
| Output length control | 10–30% on output | Low — prompt + max_tokens |
| All combined | 50–75% | 1–2 weeks of engineering |
See how much you'd save by switching models or reducing token counts — plug your numbers into the calculator.
Open the Calculator →