Five techniques compound. Each one cuts a different slice of the bill — stacking three of them realistically gets you to 70% reduction without any quality loss; all five gets you closer to 85-90%. This guide is the full menu, with measured savings and runnable code for each technique. Each technique links to a focused deep dive at the bottom.

The five techniques, ranked by ROI

Prompt caching (60-90% off input on repeat prompts)
Model tiering (10x cheaper for simple tasks)
Output caps (caps the dominant cost on creative tasks)
Parallelism + batching (cuts wall-clock, not per-token, but enables tighter timeouts and fewer retries)
Retry hygiene (don't pay for runaway retries)

1. Prompt caching — the highest-ROI technique by far

If your system prompt is over 1,000 tokens and you call the same prompt more than twice in a 5-minute window, prompt caching is free money. Anthropic charges cached input at 10% of the input rate; OpenAI does the equivalent automatically at 20%. One field added:

cache_control.py

# Prompt caching: 1 extra field = 10% rate on cached input
resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=600,
    system=[{
        "type": "text",
        "text": LONG_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"},   # cache this part
    }],
    messages=[{"role": "user", "content": question}],
)

Realistic savings: 60-90% off your input cost line if your system prompt is large and stable. Look at your /app/usage dashboard — if cache_read_input_tokens is 0, you have 30 minutes of work ahead that pays back forever. Deep dive.

2. Model tiering — use the cheapest model that's good enough

Claude Haiku 4.5 costs roughly 10% of Claude Opus 4.7. For classification, formatting, routing decisions, simple summaries — Haiku is more than enough. Reserve Opus for the genuinely hard reasoning calls.

tiering.py

# Pick the cheapest model that's still good enough for the task.
# Escalate by difficulty: cheap (Haiku 4.5 / Gemini 2.5 Flash) -> Sonnet 4.6 -> Opus 4.7
def pick_model(difficulty: int) -> str:
    if difficulty <= 2: return "claude-haiku-4-5"     # ~10x cheaper than Opus
    if difficulty <= 4: return "gemini-2-5-flash"     # very strong, cheap output
    if difficulty <= 7: return "claude-sonnet-4-6"
    return "claude-opus-4-7"

How to build the tiering decision: a small classifier first (Haiku itself can do this in 100 input tokens), then route. Measure output quality with held-out evals. Don't let intuition pick — bench it.

Tier 0 (Haiku 4.5): classification, extraction, translation, summarization of <2K input — cost-per-call cents
Tier 1 (Gemini 2.5 Flash): longer summarization, code completion, simple tool use — also cents but better at long context
Tier 2 (Sonnet 4.6): real reasoning, multi-step agents, vision tasks — the production workhorse
Tier 3 (Opus 4.7): only when you've confirmed Sonnet doesn't get it right

3. Output caps — limit what they generate, not just what they read

Output tokens are typically 4-5x more expensive than input. A loose max_tokens=4096 default invites the model to ramble. Cap aggressively, and use stop sequences to cut at the right marker:

caps.py

# Two-layer caps: per-call max_tokens + per-key wallet cap
client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[...],
    max_tokens=800,             # cap each call
    stop=["</answer>"],         # cut at terminator
)
# /app/keys → set "Spending limit" per key:
#   Production: $50/day · Experiment: $5/day · CI: $1/day

Pair the per-call max_tokens with a per-key wallet spending cap in /app/keys — a runaway loop or bug can't exceed your daily ceiling. Two layers, both important.

4. Parallelism — for batch-shaped workloads

If you're processing 1,000 independent items (classifying feedback, scoring leads, generating descriptions), serial calls take an hour. Async with AsyncOpenAI and a concurrency limit gets you there in under a minute — and tighter timeouts let you fail-fast on slow upstreams. Parallelism doesn't cut per-token cost, but it enables architectural patterns (idempotent processing, tighter retry budgets) that do.

5. Retry hygiene — don't pay for runaway retries

Naive code retries on every error including 400s. That's paying for failures (4xx requests at Kunavo aren't billed, but you still burn rate-limit budget and wall-clock). Correct policy: retry only 408/429/5xx, exponential backoff with jitter, max 5 attempts and 30 seconds total wait. Anything else, fail fast and surface the error.

What the savings look like in practice

Realistic measured savings on a Kunavo-using SaaS with $5,000/month spend:

Prompt caching added: $2,000 saved (−40%)
Haiku tiering for classification: $800 saved (−16%)
Tighter max_tokens: $400 saved (−8%)
Combined: $3,200/month saved, $1,800 spent (−64% total)

These compound — each next technique works on the already-reduced baseline. Without caching applied first, the others save less. Order matters: start with #1, then #2, then #3.

What does not work as advertised

"Embedding for everything" — embeddings save on text search, not on generation. They're cheap but they don't reduce LLM-call cost
"Local fine-tunes for cost" — only justifiable if you're processing 10M+ tokens/day on a narrow task. For most teams, prompt caching + Haiku tiering beats the operational complexity of self-hosting
"Cheaper aggregators" — Kunavo is already 30% under official upstream pricing. Cheaper aggregators usually win by silently swapping your requested model to a cheaper one

Where to go next

Apply caching first — it's the highest ROI per hour invested. Then instrument your /app/usage dashboard breakdown by model and tier. Then tier. Track monthly cost-per-active-user as your north star metric — if it's flat-to-down while engagement grows, you're winning.

AI cost optimization — the complete guide to cutting 70-90% off your LLM bill