Docs
Chat completions
Kunavo's /v1/chat/completions endpoint is bit-for-bit OpenAI-compatible across the Claude and Gemini families. Streaming, tools, vision, reasoning — all work via the same SDK.
Endpoint: POST /v1/chat/completions. Request and response shape match OpenAI's chat completions API exactly — including streaming and the optional tool_calls / reasoning_tokens fields.
Basic call
resp = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[
{"role": "system", "content": "You are a senior staff engineer."},
{"role": "user", "content": "Pros and cons of postgres LISTEN/NOTIFY for a job queue?"},
],
temperature=0.4,
max_tokens=800,
)
print(resp.choices[0].message.content)Parameters
Every standard OpenAI parameter is accepted. Some only make sense for certain providers — the translator passes through what each upstream supports.
| Param | Type | Notes |
|---|---|---|
model | string (required) | Any enabled slug from /v1/models. |
messages | array (required) | Standard OpenAI message format. |
temperature | 0..2 | Sampling temperature. Default 1. |
top_p | 0..1 | Nucleus sampling. Mutually exclusive with temperature in some models. |
max_tokens | int | Output cap. Reasoning tokens count separately. |
stream | bool | Stream chunks as SSE. See below. |
tools | array | Function/tool definitions. Claude and Gemini both support tool use. |
tool_choice | auto|none|named | Force a specific tool or let the model decide. |
response_format | object | Set <code>{type: "json_object"}</code> for guaranteed JSON output. |
seed | int | Deterministic sampling where supported. |
stop | string|array | Hard stop sequences. |
Streaming
Set stream=True. Kunavo emits server-sent events in OpenAI's exact format: each chunk is a chat.completion.chunk with choices[0].delta.content. The final usage payload arrives with data: [DONE].
stream = client.chat.completions.create(
model="gemini-3-pro",
messages=[{"role": "user", "content": "Explain B-trees in one paragraph."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)Tool / function calling
Tool calling works across providers. Define tools as JSON schema; the model returns tool_calls in its message; you execute and feed results back as role: "tool" messages.
tools = [{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"},
"unit": {"type": "string", "enum": ["c", "f"]},
},
"required": ["city"],
},
},
}]
resp = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
tool_choice="auto",
)
# Inspect tool calls the model wants to make
for call in resp.choices[0].message.tool_calls or []:
print(call.function.name, call.function.arguments)Vision / multimodal input
Models with the vision capability accept image content blocks. Use either an HTTPS URL or a data: base64 URI.
# Pass an image URL or a base64 data URI as part of a multimodal message
resp = client.chat.completions.create(
model="gemini-3-pro",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {
"url": "https://example.com/cat.jpg"
}},
],
}],
)
print(resp.choices[0].message.content)Vision-capable models in the catalog: Claude Opus / Sonnet / Haiku 4.x, Gemini 3 Pro / 3 Flash, Gemini 2.5 Pro.
Reasoning tokens
Models with extended thinking (Claude Opus 4.7, Sonnet 4.6, Gemini 3 Pro) emit reasoning tokens in addition to visible output. They're billed at the output rate. Inspect them via usage.completion_tokens_details.reasoning_tokens.
# Claude's "extended thinking" mode + Gemini "thinking" mode both surface
# reasoning_tokens in usage. You're billed for them at the output rate.
resp = client.chat.completions.create(
model="claude-opus-4-7",
messages=[{"role": "user", "content": "Plan a 3-week MLOps migration."}],
)
print(resp.usage.completion_tokens_details.reasoning_tokens)Prompt caching
A long system prompt, a reference document, a few-shot block — any stable prefix can be cached upstream and replayed on later calls at a fraction of the input price. Cache hits surface in the usage object as prompt_tokens_details.cached_tokens.
Gemini and GPT models cache automatically — no request change needed. cached_tokens is a subset of prompt_tokens and is billed at a reduced cache-read rate.
Claude caches only the prefix you mark with a cache_control breakpoint. Through this OpenAI-compatible endpoint, attach it to a content block:
# Claude caches prefixes you mark with cache_control. Attach it to a content
# block; later calls reusing that prefix read it at ~10% of the input price.
resp = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": LONG_DOCUMENT, # the stable, reused prefix
"cache_control": {"type": "ephemeral"},
},
{"type": "text", "text": "Summarize the document above."},
],
}],
)
print(resp.usage.prompt_tokens_details.cached_tokens)cache_creation_input_tokens / cache_read_input_tokens usage fields — use the native Messages API.Usage object
Every response (and the final streaming chunk) includes a usage object:
| Field | Meaning |
|---|---|
prompt_tokens | Input tokens we billed — cached tokens included. |
prompt_tokens_details.cached_tokens | Cached input — a subset of prompt_tokens, billed at a reduced cache-read rate. |
completion_tokens | Visible output tokens. |
completion_tokens_details.reasoning_tokens | Reasoning tokens (billed at output rate). |
total_tokens | Sum of input + output + reasoning. |
credits_consumed | Kunavo addition. Raw cost in kie credits (1 credit = $0.005). |