← Back to Blog

RAG Chatbots for Your Knowledge Base: A Practical Guide

Your company has thousands of pages of internal documentation. Wiki articles, PDFs, onboarding guides, SOPs, Slack archives, Google Docs. Nobody can find anything. Your team keeps asking the same five senior people the same ten questions.

You want to build "ChatGPT for our docs." Someone told you the term is RAG. Here's what RAG actually is, what works, what breaks, and what it costs.

What RAG Actually Is

RAG stands for retrieval-augmented generation. It's a pattern where:

  1. You chunk your documents into small pieces.
  2. You turn each chunk into a vector (an "embedding")—a list of numbers that represents the meaning of the text.
  3. You store those vectors in a vector database.
  4. When a user asks a question, you turn the question into a vector too, find the most similar chunks from your database, and pass them to an LLM as context.
  5. The LLM reads the retrieved chunks and writes an answer grounded in them.

That's it. The magic word is "grounded." The LLM doesn't make things up from its training data—it answers using your specific documents.

Why RAG Instead of Just Fine-Tuning a Model

Three reasons:

For "answer questions about our docs" use cases, RAG is almost always the right answer.

What a Real RAG System Looks Like

The architecture in production:

User question
      |
      v
Embed question into vector
      |
      v
Query vector DB for top N similar chunks
      |
      v
(Optional: re-rank chunks for quality)
      |
      v
Send question + chunks to LLM
      |
      v
LLM generates answer grounded in chunks
      |
      v
Return answer with citations

Each arrow is a decision point. Each decision can make or break the system.

Where RAG Systems Break

Bad Chunking

You have a 60-page PDF. If you chunk it in 500-character pieces, you break paragraphs mid-sentence and the retrieval finds fragments without context. If you chunk in 5000-character pieces, the retrieved chunks are so big that the LLM loses track of what matters.

The right chunk size depends on your docs. For most knowledge bases: 800–1500 characters with ~200 characters of overlap between chunks. For highly structured docs (API specs, product manuals), respect section boundaries instead of fixed sizes.

Bad Retrieval

The user asks, "What's our refund policy?" The retrieval returns the five chunks with the highest vector similarity. But the chunk that actually contains the refund policy is ranked #8. Now the LLM is answering from chunks about other policies.

Causes: embedding model too generic, chunks missing section headers or titles, queries too short, vector DB not tuned for scale.

Fixes:

Hallucination

The LLM can't find a good answer in the retrieved chunks, so it makes one up. This is the failure mode that kills trust.

Fixes:

Stale or Wrong Docs

The AI answers correctly from your docs. But your docs are wrong. The refund policy changed six months ago and nobody updated the wiki.

RAG isn't fixing your documentation. It's amplifying whatever's there. If your docs are stale, RAG will confidently tell users the wrong thing.

Fix: audit and update the docs before you build the RAG system. Otherwise the RAG system becomes an expensive version of bad information.

User Questions You Didn't Expect

Users ask things your docs don't cover. "What's our policy on X?" where X was never documented. RAG returns irrelevant chunks and the LLM tries to answer anyway.

Fix: log unanswered questions. Review weekly. Either update the docs, route the question to a human, or both.

Cost Structure

Running RAG at scale costs three things:

1. Embeddings (One-Time + Incremental)

Embedding a document corpus once costs pennies. OpenAI's text-embedding-3-small is around $0.02 per million tokens. A 1000-document knowledge base might cost $1–5 to embed.

2. Vector Database (Monthly)

For under 100K documents, the self-hosted path is usually fine and cheaper.

3. LLM Inference (Per Query)

Every user question hits an LLM. GPT-4o-mini or Claude Haiku is usually sufficient for RAG.

Typical cost per query: $0.001–$0.02 depending on model and context size.

For a system handling 10K queries/month, expect $10–$200/month in LLM costs.

Total run rate for a typical internal RAG system: $50–$500/month. Occasionally higher for large organizations.

When to Build vs Use Off-the-Shelf

Off-the-shelf options for "chat with your docs":

Build custom if:

Buy off-the-shelf if:

How Long It Actually Takes

For a custom RAG chatbot on a well-organized knowledge base:

About a month for a working system. Longer if your docs are a mess, if you need custom UI, or if you're integrating with existing software.

The Honest Bottom Line

RAG chatbots work. They save real time for companies with lots of internal documentation. But most RAG projects I see fail not because of the technology—they fail because:

  1. The docs weren't ready (stale, incomplete, contradictory)
  2. The retrieval wasn't tuned (all defaults, no experimentation)
  3. Users weren't onboarded (nobody knew it existed)
  4. Feedback loops weren't built (no way to improve)

Solve those four and RAG delivers. Skip them and RAG becomes an expensive disappointment.

Building a RAG chatbot for your knowledge base?

I build custom RAG systems on top of messy real-world documentation. Clean architecture, grounded answers, citations, honest fallbacks when the docs don't have the answer.

Schedule a Call →