WTF is AI Cost Optimization!?
A simple explanation for humans who don't speak robot (yet)
Hey again! Back from last week’s observability deep-dive where we learned how to actually see what your AI is doing instead of praying it behaves.
You’ve got observability now. You can see every request, every token, every dollar flying out the window. And what you’re seeing is... terrifying.
$3,000/day for a chatbot that answers the same 50 questions on repeat. GPT-5.2 processing “What are your business hours?” like it’s solving the Riemann hypothesis. Your CFO asking why the “AI experiment” line item is bigger than the engineering team’s coffee budget.
Welcome to AI Cost Optimization. The unsexy practice of not lighting money on fire while still delivering quality AI experiences.
You Guys LOVE Horror Stories
The 10 Billion Token Month
When Anthropic launched Claude Code’s “Max Unlimited” plan at $200/month, they thought they’d built in enough margin. They were spectacularly wrong.
Some users consumed 10 billion tokens in a single month—equivalent to processing 12,500 copies of War and Peace. Users discovered they could set Claude on automated tasks: check work, refactor, optimize, repeat until bankruptcy.
Anthropic tried 10x premium pricing, dynamic model scaling, weekly rate limits. Token consumption still went supernova. The evolution from chat to agent happened overnight—a 1000x increase representing a phase transition, not gradual change.
The $60K Surprise
One company shared their journey publicly: Month 1 was $2,400. Month 2 hit $15,000. Month 3: $35,000. By Month 4 they were touching $60,000—an annual run-rate of $700K.
Their monitoring before this? Monthly billing statements. That’s it.
The Tier-1 Problem
Tier-1 financial institutions are spending up to $20 million daily on generative AI costs. Daily. At those numbers, a 10% optimization isn’t a nice-to-have—it’s $2 million per day back in your pocket.
The Cost (Dec 2025)
OpenAI:
GPT-5.2 (flagship): $1.75 input / $14 output per million tokens
GPT-5: $1.25 input / $10 output per million tokens
GPT-5 mini: $0.25 input / $2 output per million tokens
GPT-5 nano: $0.05 input / $0.40 output per million tokens
Anthropic:
Claude Opus 4.5: $5 input / $25 output per million tokens
Claude Sonnet 4.5: $3 input / $15 output per million tokens
Claude Haiku 4.5: $1 input / $5 output per million tokens
Google:
Gemini 3 Pro (flagship): $2 input / $12 output per million tokens
Gemini 2.5 Pro: $1.25 input / $10 output per million tokens
Gemini 2.5 Flash: $0.30 input / $2.50 output per million tokens
Gemini 2.0 Flash: $0.10 input / $0.40 output per million tokens
The Math That Ruins Your Day
A “What are your business hours?” query (~500 input + ~50 output tokens):
With GPT-5.2: ~$0.0016 per query
With Gemini 2.0 Flash: ~$0.00007 per query
With GPT-5 nano: ~$0.000045 per query
That’s a 35x difference for a question a regex could answer.
At 100K queries/month:
GPT-5.2: $160/month
Gemini 2.0 Flash: $7/month
GPT-5 nano: $4.50/month
Cached response: ~$0/month
The Good, Bad, and Ugly (and Stupid) of Cost Optimization
1. Model Routing: Right Tool for the Job (The Good)
The biggest waste: using your most expensive model for everything.
80% of production queries don’t need frontier models. FAQ answers, simple classification, basic extraction, summarization—all can run on nano/Haiku tier models. Only complex reasoning and multi-step planning need the expensive stuff.
The Economics: A routing call using GPT-5 nano costs ~$0.00001. If routing saves you from using GPT-5.2 on 80% of queries, you get a 120x return on the routing investment.
The Hierarchy That Works:
Rule-based routing first (free) — catches 40-60% of obvious cases
Cheap classifier second — handles ambiguous queries
Expensive model only when needed
Real companies report cascading flows: nano → mini → standard → flagship. Most queries never touch the expensive models.
2. Prompt Caching: Stop Reprocessing the Same Stuff (The Bad)
Every major provider now offers prompt caching with massive discounts:
OpenAI GPT-5 family: 90% off cached input tokens
Anthropic: 90% off cache reads
Google Gemini: 90% off cache reads (storage fees apply)
The model stores its internal computation states for static content. Instead of re-reading your 50-page company policy for every question, it “remembers” its understanding.
The Economics: A 40-page document (~30,000 tokens), 10 questions:
Without caching: 300,000 input tokens billed
With caching: ~57,000 effective tokens (81% reduction)
What To Cache: System prompts, RAG context, few-shot examples, static reference material. Structure prompts with static content first.
3. Semantic Caching: Stop Paying for the Same Question Twice (The Ugly)
User A: “What is your return policy?” User B: “Whats ur return policy” User C: “Can I return items?”
Four API calls. Four charges. Same answer.
Store query meanings as embeddings, use similarity search to find matches. If there’s a close match, return the cached response—no LLM call.
The Stats: Research shows 31-33% of queries are semantically similar to previous ones. For customer service, often higher.
Reported hit rates:
General chatbots: 20-30%
FAQ/support bots: 40-60%
The Economics: Embedding cost is ~$0.00001/query. If 30% of 100K queries are cache hits on Claude Sonnet 4.5, you save ~$89/month after embedding costs.
4. Batch Processing: 50% Off Everything (The Stupid)
OpenAI, Anthropic, and Google all offer 50% off for non-urgent requests via Batch API. Results typically return within hours, guaranteed within 24.
When to Use: Daily reports, bulk content creation, document processing, embeddings generation, evaluation runs. Anything that doesn’t need immediate response.
The Economics: 1000 summarization requests with GPT-5:
Real-time: $3.00
Batch: $1.50
A startup spending $5,000/month reported saving $1,500-2,000/month just by moving background jobs to batch.
The Fine Print
Reasoning Tokens (The Invisible Tax)
O-series models and GPT-5.2 “Thinking” mode use internal reasoning tokens that are billed as output but not visible in responses. A query returning 200 visible tokens might consume 2,000 reasoning tokens internally.
Track the full usage field, not just visible output.
Long Context Premium
Claude Sonnet 4.5’s 1M token context:
Under 200K tokens: $3/$15
Over 200K tokens: $6/$22.50 (double the price)
Chunk large documents. Only use long context when truly necessary.
Tool Use Overhead
Every tool adds tokens—definitions, call blocks, result blocks. The bash tool alone adds 245 input tokens per call. In agentic workflows with dozens of tool calls, overhead compounds fast.
What Teams Actually Achieve
Startup A (Customer Service Bot)
Before: $4,500/month
After: Semantic cache (30% hits), routing (50% to Haiku), prompt caching
Result: $1,625/month (64% reduction)
Startup B (Document Analysis)
Before: $12,000/month
After: Batch API, model routing (70% to mini), semantic caching
Result: $3,000/month (75% reduction)
Pattern: 50-80% reductions are achievable for most applications without sacrificing quality.
The Checklist
Today:
Export usage logs from your provider dashboard [_]
Identify your top 3 most expensive prompts [_]
Move batch-eligible work to Batch API (instant 50%) [_]
Enable prompt caching (restructure prompts if needed) [_]
This Week:
Implement rule-based routing for obvious cases [_]
Add semantic caching layer [_]
Audit prompt length (most are 40% bloated) [_]
Set up cost alerting [_]
This Month:
Build full cascading model hierarchy [_]
Fine-tune cache thresholds based on quality [_]
Track cost-per-quality, not just cost-per-token [_]
The TL;DR
LLM costs scale linearly. Most teams use expensive models for everything. 80% of queries don’t need frontier models.
The Solutions:
Model Routing: 35x savings using nano vs flagship
Prompt Caching: 90% off cached tokens
Semantic Caching: 20-60% of queries skip LLM entirely
Batch API: 50% off for 24-hour turnaround
The Results: 50-80% reductions, quality unchanged, payback within first week.
The first time you see a $5,000 bill become $1,500 without quality impact, you’ll wonder why you waited.
Ship optimization. Not invoices.
We’re taking a break for the holidays! I’ll be back on January 7th with “WTF is Happening in AI!? (2026)”.
Happy holidays! 🎄
See you next Wednesday (in January) 🤞


