The code AI landscape, briefly
Cursor, Aider, Cline, Continue.dev, Claude Code itself — they all sit on top of the same three or four frontier LLMs. The differentiator isn't the model; it's the agentic loop around it, the file indexing, the diff application, the user experience. If you're building one of these (or a coding co-pilot inside your own developer product), this page is the model menu, the cost reality, and the latency budget.
Model menu for code workflows
- Claude Haiku 4.5 — autocomplete (~$0.0001-0.0005 per completion). The latency winner; ~150ms time-to-first-token
- Claude Sonnet 4.6 — chat-with-codebase, multi-file edits, agentic loops. The Cursor-level workhorse
- Claude Opus 4.7 — deep review, architecture critique, security audit. Slower (P50 ~3s) but the strongest reasoner
- Gemini 3 Pro — best for large codebases (2M context). Drop the whole monorepo into the context
- GPT-5 (when enabled) — comparable quality, different stylistic preferences
The two core flows
from openai import OpenAI
client = OpenAI(api_key="sk-kunavo-...", base_url="https://api.kunavo.com/v1")
# Code completion (autocomplete-as-you-type)
def complete(file_content: str, cursor_offset: int, file_path: str) -> str:
before = file_content[:cursor_offset]
after = file_content[cursor_offset:]
resp = client.chat.completions.create(
model="claude-haiku-4-5", # FAST for IDE latency
messages=[
{"role": "system", "content": (
"Complete the code at the cursor. Output ONLY the inserted "
"text — no markdown, no explanation."
)},
{"role": "user", "content": (
f"<file path=\"{file_path}\">\n{before}<CURSOR>{after}\n</file>"
)},
],
max_tokens=200,
stop=["<CURSOR>", "</file>"],
)
return resp.choices[0].message.content
# Code review (heavier model)
def review(diff: str, conventions: str) -> dict:
resp = client.chat.completions.create(
model="claude-opus-4-7",
messages=[
{"role": "system", "content": [{
"type": "text",
"text": conventions, # team style guide, security policy
"cache_control": {"type": "ephemeral"},
}]},
{"role": "user", "content": (
f"Review this diff. Return JSON: {{ \"comments\": [{{...}}], "
f"\"verdict\": \"approve|request_changes\"}}\n\n{diff}"
)},
],
response_format={"type": "json_object"},
max_tokens=2000,
)
return json.loads(resp.choices[0].message.content)Latency budget by feature
- Inline autocomplete: TTFT must be <200ms. Haiku or Gemini Flash. Stream the output, abort on cursor move
- Chat-with-codebase: TTFT <500ms acceptable. Sonnet + streaming. Show the response as it generates
- Multi-file edit / refactor: 5-30s is normal. Show progress, run in background, surface the diff for review
- PR review: 10-60s acceptable. Opus does the deep analysis, format as actionable comments
Cost economics for a code product
Realistic per-developer-per-month consumption (heavy user):
- ~500 completions/day × Haiku ~$0.0003 = $0.15/day = $4-5/month
- ~30 chat sessions/day × Sonnet ~$0.05 = $1.50/day = $30-45/month
- ~3 PR reviews/week × Opus ~$0.30 = $1/week = $4/month
- Heavy power user: ~$40-55/month in API costs
- Average user: ~$10-15/month
Most code AI products price at $20/user/month — there's margin room but it's tight on power users. Prompt caching is essential: the workspace context (file structure, conventions, recent files) goes in cache; per-call only the immediate edit context is fresh. Saves 50-70% on inputs. How to set up caching.
Architectural patterns to copy
- FIM (Fill-in-Middle) prompting: instead of "complete from here", pass
before<CURSOR>afterso the model uses both sides - Stop sequences:
stop=["<CURSOR>", "</file>", "```\n"]prevents the model from inventing beyond the requested completion - Tool use for multi-file edits: don't ask the model to output 5 files in one response. Use
tools=[{name: 'edit_file', ...}]so it produces structured edits you apply atomically - Speculative decoding via streaming: show the first ~50 tokens as soon as they come; users will pick the right one even before the full response is done
Switching providers in seconds
Kunavo is OpenAI-wire-compatible. If you built your assistant on OpenAI's SDK, swap base_url and you have access to Claude and Gemini under the same code. See Calling Claude with the OpenAI SDK for the migration.
Start: /app/signup for $2 credit (~6,000 autocomplete calls). Endpoint reference at /docs/chat and Anthropic-native Messages API at /docs/messages.