Five techniques compound. Each one cuts a different slice of the bill — stacking three of them realistically gets you to 70% reduction without any quality loss; all five gets you closer to 85-90%. This guide is the full menu, with measured savings and runnable code for each technique. Each technique links to a focused deep dive at the bottom.
The five techniques, ranked by ROI
- Prompt caching (60-90% off input on repeat prompts)
- Model tiering (10x cheaper for simple tasks)
- Output caps (caps the dominant cost on creative tasks)
- Parallelism + batching (cuts wall-clock, not per-token, but enables tighter timeouts and fewer retries)
- Retry hygiene (don't pay for runaway retries)
1. Prompt caching — the highest-ROI technique by far
If your system prompt is over 1,000 tokens and you call the same prompt more than twice in a 5-minute window, prompt caching is free money. Anthropic charges cached input at 10% of the input rate; OpenAI does the equivalent automatically at 20%. One field added:
# Prompt caching: 1 extra field = 10% rate on cached input
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=600,
system=[{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # cache this part
}],
messages=[{"role": "user", "content": question}],
)Realistic savings: 60-90% off your input cost line if your system prompt is large and stable. Look at your /app/usage dashboard — if cache_read_input_tokens is 0, you have 30 minutes of work ahead that pays back forever. Deep dive.
2. Model tiering — use the cheapest model that's good enough
Claude Haiku 4.5 costs roughly 10% of Claude Opus 4.7. For classification, formatting, routing decisions, simple summaries — Haiku is more than enough. Reserve Opus for the genuinely hard reasoning calls.
# Pick the cheapest model that's still good enough for the task.
# Cost order (Kunavo prices): Haiku 4.5 < Gemini 3 Flash < Sonnet 4.6 < Opus 4.7
def pick_model(difficulty: int) -> str:
if difficulty <= 2: return "claude-haiku-4-5" # ~10x cheaper than Opus
if difficulty <= 4: return "gemini-3-flash" # very strong, cheap output
if difficulty <= 7: return "claude-sonnet-4-6"
return "claude-opus-4-7"How to build the tiering decision: a small classifier first (Haiku itself can do this in 100 input tokens), then route. Measure output quality with held-out evals. Don't let intuition pick — bench it.
- Tier 0 (Haiku 4.5): classification, extraction, translation, summarization of <2K input — cost-per-call cents
- Tier 1 (Gemini 3 Flash): longer summarization, code completion, simple tool use — also cents but better at long context
- Tier 2 (Sonnet 4.6): real reasoning, multi-step agents, vision tasks — the production workhorse
- Tier 3 (Opus 4.7): only when you've confirmed Sonnet doesn't get it right
3. Output caps — limit what they generate, not just what they read
Output tokens are typically 4-5x more expensive than input. A loose max_tokens=4096 default invites the model to ramble. Cap aggressively, and use stop sequences to cut at the right marker:
# Two-layer caps: per-call max_tokens + per-key wallet cap
client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[...],
max_tokens=800, # cap each call
stop=["</answer>"], # cut at terminator
)
# /app/keys → set "Spending limit" per key:
# Production: $50/day · Experiment: $5/day · CI: $1/dayPair the per-call max_tokens with a per-key wallet spending cap in /app/keys — a runaway loop or bug can't exceed your daily ceiling. Two layers, both important.
4. Parallelism — for batch-shaped workloads
If you're processing 1,000 independent items (classifying feedback, scoring leads, generating descriptions), serial calls take an hour. Async with AsyncOpenAI and a concurrency limit gets you there in under a minute — and tighter timeouts let you fail-fast on slow upstreams. Parallelism doesn't cut per-token cost, but it enables architectural patterns (idempotent processing, tighter retry budgets) that do.
5. Retry hygiene — don't pay for runaway retries
Naive code retries on every error including 400s. That's paying for failures (4xx requests at Kunavo aren't billed, but you still burn rate-limit budget and wall-clock). Correct policy: retry only 408/429/5xx, exponential backoff with jitter, max 5 attempts and 30 seconds total wait. Anything else, fail fast and surface the error.
What the savings look like in practice
Realistic measured savings on a Kunavo-using SaaS with $5,000/month spend:
- Prompt caching added: $2,000 saved (−40%)
- Haiku tiering for classification: $800 saved (−16%)
- Tighter
max_tokens: $400 saved (−8%) - Combined: $3,200/month saved, $1,800 spent (−64% total)
These compound — each next technique works on the already-reduced baseline. Without caching applied first, the others save less. Order matters: start with #1, then #2, then #3.
What does not work as advertised
- "Embedding for everything" — embeddings save on text search, not on generation. They're cheap but they don't reduce LLM-call cost
- "Local fine-tunes for cost" — only justifiable if you're processing 10M+ tokens/day on a narrow task. For most teams, prompt caching + Haiku tiering beats the operational complexity of self-hosting
- "Cheaper aggregators" — Kunavo is already 30% under official upstream pricing. Cheaper aggregators usually win by silently swapping your requested model to a cheaper one
Where to go next
Apply caching first — it's the highest ROI per hour invested. Then instrument your /app/usage dashboard breakdown by model and tier. Then tier. Track monthly cost-per-active-user as your north star metric — if it's flat-to-down while engagement grows, you're winning.