The code AI landscape, briefly

Cursor, Aider, Cline, Continue.dev, Claude Code itself — they all sit on top of the same three or four frontier LLMs. The differentiator isn't the model; it's the agentic loop around it, the file indexing, the diff application, the user experience. If you're building one of these (or a coding co-pilot inside your own developer product), this page is the model menu, the cost reality, and the latency budget.

Model menu for code workflows

Claude Haiku 4.5 — autocomplete (~$0.0001-0.0005 per completion). The latency winner; ~150ms time-to-first-token
Claude Sonnet 4.6 — chat-with-codebase, multi-file edits, agentic loops. The Cursor-level workhorse
Claude Opus 4.7 — deep review, architecture critique, security audit. Slower (P50 ~3s) but the strongest reasoner
Gemini 2.5 Pro — best for large codebases (1M context). Drop the whole monorepo into the context
GPT-5 (when enabled) — comparable quality, different stylistic preferences

The two core flows

code_assistant.py

from openai import OpenAI
client = OpenAI(api_key="sk-kn-...", base_url="https://api.kunavo.com/v1")

# Code completion (autocomplete-as-you-type)
def complete(file_content: str, cursor_offset: int, file_path: str) -> str:
    before = file_content[:cursor_offset]
    after = file_content[cursor_offset:]
    resp = client.chat.completions.create(
        model="claude-haiku-4-5",   # FAST for IDE latency
        messages=[
            {"role": "system", "content": (
                "Complete the code at the cursor. Output ONLY the inserted "
                "text — no markdown, no explanation."
            )},
            {"role": "user", "content": (
                f"<file path=\"{file_path}\">\n{before}<CURSOR>{after}\n</file>"
            )},
        ],
        max_tokens=200,
        stop=["<CURSOR>", "</file>"],
    )
    return resp.choices[0].message.content

# Code review (heavier model)
def review(diff: str, conventions: str) -> dict:
    resp = client.chat.completions.create(
        model="claude-opus-4-7",
        messages=[
            {"role": "system", "content": [{
                "type": "text",
                "text": conventions,  # team style guide, security policy
                "cache_control": {"type": "ephemeral"},
            }]},
            {"role": "user", "content": (
                f"Review this diff. Return JSON: {{ \"comments\": [{{...}}], "
                f"\"verdict\": \"approve|request_changes\"}}\n\n{diff}"
            )},
        ],
        response_format={"type": "json_object"},
        max_tokens=2000,
    )
    return json.loads(resp.choices[0].message.content)

Latency budget by feature

Inline autocomplete: TTFT must be <200ms. Haiku or Gemini Flash. Stream the output, abort on cursor move
Chat-with-codebase: TTFT <500ms acceptable. Sonnet + streaming. Show the response as it generates
Multi-file edit / refactor: 5-30s is normal. Show progress, run in background, surface the diff for review
PR review: 10-60s acceptable. Opus does the deep analysis, format as actionable comments

Cost economics for a code product

Realistic per-developer-per-month consumption (heavy user):

~500 completions/day × Haiku ~$0.0003 = $0.15/day = $4-5/month
~30 chat sessions/day × Sonnet ~$0.05 = $1.50/day = $30-45/month
~3 PR reviews/week × Opus ~$0.30 = $1/week = $4/month
Heavy power user: ~$40-55/month in API costs
Average user: ~$10-15/month

Most code AI products price at $20/user/month — there's margin room but it's tight on power users. Prompt caching is essential: the workspace context (file structure, conventions, recent files) goes in cache; per-call only the immediate edit context is fresh. Saves 50-70% on inputs. How to set up caching.

Architectural patterns to copy

FIM (Fill-in-Middle) prompting: instead of "complete from here", pass before<CURSOR>after so the model uses both sides
Stop sequences: stop=["<CURSOR>", "</file>", "```\n"] prevents the model from inventing beyond the requested completion
Tool use for multi-file edits: don't ask the model to output 5 files in one response. Use tools=[{name: 'edit_file', ...}] so it produces structured edits you apply atomically
Speculative decoding via streaming: show the first ~50 tokens as soon as they come; users will pick the right one even before the full response is done

Switching providers in seconds

Kunavo is OpenAI-wire-compatible. If you built your assistant on OpenAI's SDK, swap base_url and you have access to Claude and Gemini under the same code. See Calling Claude with the OpenAI SDK for the migration.

Start: /app/signup — pay-as-you-go from a $5 top-up (~15,000 autocomplete calls), balance never expires. Endpoint reference at /docs/chat and Anthropic-native Messages API at /docs/messages.

AI code assistant — building Cursor-like tools with Claude and Gemini

Recommended models