Back to use cases
Knowledge Base

RAG chatbot API — Claude-powered knowledge-base assistant in production

Most internal knowledge bases are dead documentation — nobody finds anything. A Claude-backed RAG chatbot turns them into a real assistant that cites sources and refuses when it doesn't know. Here's the production pattern.

The default modern KB chatbot architecture

Traditional KB search returns links; the user reads them. RAG turns that into a one-shot conversational answer with citations. Done well, users get answers in 2 seconds instead of digging through 5 documents. Done badly, the chatbot hallucinates and your CTO bans the project for a year. This page is the difference.

The minimum viable production stack

  1. Vector store: pgvector if you already have Postgres, Pinecone or Qdrant for managed
  2. Embeddings: text-embedding-3-large via Kunavo ($0.10/1M tokens)
  3. Retrieval: hybrid (vector + BM25 with reciprocal rank fusion)
  4. Generation: Claude Sonnet 4.6 with cache_control on the system prompt
  5. UI: streaming responses, citation rendering, "I don't know" fallback
rag_chat.py
from openai import OpenAI
client = OpenAI(api_key="sk-kunavo-...", base_url="https://api.kunavo.com/v1")

def chat(question: str, history: list[dict]) -> dict:
    chunks = hybrid_retrieve(question, k=5)  # vector + BM25
    context = "\n\n---\n\n".join(
        f"[doc:{c['id']}] {c['text']}" for c in chunks
    )
    resp = client.chat.completions.create(
        model="claude-sonnet-4-6",
        messages=[
            {"role": "system", "content": [{
                "type": "text",
                "text": (
                    "Answer only from Context. Cite [doc:N] for each claim. "
                    "If Context doesn't answer, say 'I don't have that.' "
                    "Be concise, no throat-clearing."
                ),
                "cache_control": {"type": "ephemeral"},
            }]},
            *history,
            {"role": "user", "content": f"# Context\n{context}\n\n# Q\n{question}"},
        ],
        max_tokens=600,
    )
    answer = resp.choices[0].message.content
    cited_ids = parse_citations(answer)  # extract [doc:N] references
    return {"answer": answer, "sources": [c for c in chunks if c["id"] in cited_ids]}

Cost at production scale

  • Initial indexing: ~$0.25 for 5,000 docs of 500 tokens each
  • 1,000 queries/day: ~$210/month with caching
  • 10,000 queries/day: ~$2,100/month
  • If using Haiku 4.5 instead of Sonnet: ~4x cheaper, ~85% answer quality

Full architectural breakdown in the RAG implementation guide. Language-specific tuning notes in the Japanese RAG deep dive and Spanish RAG guide.

The three patterns that actually prevent hallucination

  • Cite every claim: [doc:42] tags in the model output. If the cited id isn't in the retrieved set, it hallucinated — block and log
  • Explicit refusal in system prompt: "If Context doesn't answer, say 'I don't have that.'" Without this, the model fills in from world knowledge
  • Output cap of 600 tokens: short answers are usually accurate answers. Longer outputs are where extra inventions sneak in

What to ship in week 1 vs week 4

WeekMilestone
1100 docs, single chunking strategy, basic vector search, 10-question eval set, ~70% recall@5
2Full corpus, hybrid retrieval, 100-question eval set, citations enforced, internal beta
3Tune chunking based on failed questions, ship to internal users, measure CSAT
4Monitoring + cost dashboard, daily spend cap, public beta or production launch

Start at /app/signup

$2 free credit covers initial indexing of ~50K docs and 300 test queries — more than enough for a functional prototype. Then read the complete RAG guide for production patterns.