Your company has thousands of pages of internal documentation. Wiki articles, PDFs, onboarding guides, SOPs, Slack archives, Google Docs. Nobody can find anything. Your team keeps asking the same five senior people the same ten questions.
You want to build "ChatGPT for our docs." Someone told you the term is RAG. Here's what RAG actually is, what works, what breaks, and what it costs.
What RAG Actually Is
RAG stands for retrieval-augmented generation. It's a pattern where:
- You chunk your documents into small pieces.
- You turn each chunk into a vector (an "embedding")—a list of numbers that represents the meaning of the text.
- You store those vectors in a vector database.
- When a user asks a question, you turn the question into a vector too, find the most similar chunks from your database, and pass them to an LLM as context.
- The LLM reads the retrieved chunks and writes an answer grounded in them.
That's it. The magic word is "grounded." The LLM doesn't make things up from its training data—it answers using your specific documents.
Why RAG Instead of Just Fine-Tuning a Model
Three reasons:
- Fresher data. RAG reads your docs every time. Fine-tuning freezes them at training time.
- Cheaper. Fine-tuning costs thousands per round. RAG costs pennies per query.
- Better traceability. RAG can cite which document it pulled from. Fine-tuned models can't.
For "answer questions about our docs" use cases, RAG is almost always the right answer.
What a Real RAG System Looks Like
The architecture in production:
User question
|
v
Embed question into vector
|
v
Query vector DB for top N similar chunks
|
v
(Optional: re-rank chunks for quality)
|
v
Send question + chunks to LLM
|
v
LLM generates answer grounded in chunks
|
v
Return answer with citations
Each arrow is a decision point. Each decision can make or break the system.
Where RAG Systems Break
Bad Chunking
You have a 60-page PDF. If you chunk it in 500-character pieces, you break paragraphs mid-sentence and the retrieval finds fragments without context. If you chunk in 5000-character pieces, the retrieved chunks are so big that the LLM loses track of what matters.
The right chunk size depends on your docs. For most knowledge bases: 800–1500 characters with ~200 characters of overlap between chunks. For highly structured docs (API specs, product manuals), respect section boundaries instead of fixed sizes.
Bad Retrieval
The user asks, "What's our refund policy?" The retrieval returns the five chunks with the highest vector similarity. But the chunk that actually contains the refund policy is ranked #8. Now the LLM is answering from chunks about other policies.
Causes: embedding model too generic, chunks missing section headers or titles, queries too short, vector DB not tuned for scale.
Fixes:
- Add metadata to chunks so you can filter by document type, date, department
- Use hybrid search (vector + keyword) rather than pure vector similarity
- Add a re-ranker that reorders the top candidates using a more accurate (and slower) model
Hallucination
The LLM can't find a good answer in the retrieved chunks, so it makes one up. This is the failure mode that kills trust.
Fixes:
- Explicit instructions in the prompt: "Only answer using the provided context. If the answer isn't in the context, say you don't know."
- Require citations: "Every factual claim must cite the source chunk."
- Show citations to the user: transparency lets them verify.
- Reject low-confidence retrievals: if the top chunk's similarity score is below a threshold, return "I can't find that in the docs" rather than guess.
Stale or Wrong Docs
The AI answers correctly from your docs. But your docs are wrong. The refund policy changed six months ago and nobody updated the wiki.
RAG isn't fixing your documentation. It's amplifying whatever's there. If your docs are stale, RAG will confidently tell users the wrong thing.
Fix: audit and update the docs before you build the RAG system. Otherwise the RAG system becomes an expensive version of bad information.
User Questions You Didn't Expect
Users ask things your docs don't cover. "What's our policy on X?" where X was never documented. RAG returns irrelevant chunks and the LLM tries to answer anyway.
Fix: log unanswered questions. Review weekly. Either update the docs, route the question to a human, or both.
Cost Structure
Running RAG at scale costs three things:
1. Embeddings (One-Time + Incremental)
Embedding a document corpus once costs pennies. OpenAI's text-embedding-3-small is around $0.02 per million tokens. A 1000-document knowledge base might cost $1–5 to embed.
2. Vector Database (Monthly)
- Self-hosted (Postgres + pgvector, Qdrant, Weaviate): free software, pay for hosting. ~$20–$200/month.
- Managed (Pinecone, Weaviate Cloud): $70–$500/month for typical usage.
For under 100K documents, the self-hosted path is usually fine and cheaper.
3. LLM Inference (Per Query)
Every user question hits an LLM. GPT-4o-mini or Claude Haiku is usually sufficient for RAG.
Typical cost per query: $0.001–$0.02 depending on model and context size.
For a system handling 10K queries/month, expect $10–$200/month in LLM costs.
Total run rate for a typical internal RAG system: $50–$500/month. Occasionally higher for large organizations.
When to Build vs Use Off-the-Shelf
Off-the-shelf options for "chat with your docs":
- ChatGPT with Custom GPTs / Projects — decent for small, static knowledge bases. Limited control.
- Glean, Notion AI, Mem.ai — SaaS RAG on top of your existing tools. Good UX, per-seat pricing, data stays with them.
- LangChain / LlamaIndex + OpenAI — DIY but well-documented.
Build custom if:
- You have specific doc formats the off-the-shelf tools don't handle (engineering specs, custom PDFs, internal databases)
- You need tight integration with existing software (Slack bot, internal dashboard, customer-facing)
- You have enough volume that per-seat SaaS becomes expensive
- You have compliance requirements that keep data in your infrastructure
Buy off-the-shelf if:
- Your docs live in standard SaaS tools (Google Drive, Notion, Confluence)
- You have fewer than ~30 users
- You want it working this week, not next month
How Long It Actually Takes
For a custom RAG chatbot on a well-organized knowledge base:
- Week 1: ingest docs, embed, set up vector DB, wire up baseline retrieval
- Week 2: build LLM prompt, add citations, basic UI
- Week 3: iterate on retrieval quality (chunking, hybrid search, re-ranking)
- Week 4: user testing, edge-case handling, deployment
About a month for a working system. Longer if your docs are a mess, if you need custom UI, or if you're integrating with existing software.
The Honest Bottom Line
RAG chatbots work. They save real time for companies with lots of internal documentation. But most RAG projects I see fail not because of the technology—they fail because:
- The docs weren't ready (stale, incomplete, contradictory)
- The retrieval wasn't tuned (all defaults, no experimentation)
- Users weren't onboarded (nobody knew it existed)
- Feedback loops weren't built (no way to improve)
Solve those four and RAG delivers. Skip them and RAG becomes an expensive disappointment.