I got my first RAG pipeline working in a weekend. Documents in, vectors out, plug into GPT, ask a question, get an answer. The demo was great. Then I pointed it at actual internal docs and it started confidently telling users that our API supported features we’d deprecated two years ago.
Turns out “load documents, chunk, embed, retrieve, generate” is the easy part. The hard part is everything that breaks silently: chunks that lose meaning because they got split mid-thought, embeddings that think “ARR” is pirate speak instead of Annual Recurring Revenue, search results that look relevant but don’t actually answer anything.
Why most tutorials don’t prepare you
Every RAG tutorial follows the same recipe: load documents, split into chunks, embed with OpenAI, store in a vector DB, retrieve top-K, stuff into prompt. Works on demo datasets. Falls apart on real documents.
Chunking destroys context — a fixed 500-token chunk splits paragraphs mid-sentence. Embeddings miss domain terms. Top-K retrieval gives you five slightly different versions of the same info while missing the actual answer. And the prompt is usually an afterthought.
All fixable. But you have to know where the problems are.
Chunking that preserves meaning
Fixed-size splitting is easy to implement and terrible in practice. I use a two-pass approach: split on structural boundaries first (headings, paragraphs), then merge small chunks and split large ones on sentence boundaries.
import re
from dataclasses import dataclass
@dataclass
class Chunk:
content: str
metadata: dict
token_count: int
def semantic_chunk(text: str, max_tokens: int = 512, min_tokens: int = 100) -> list[Chunk]:
sections = re.split(r'\n#{1,3}\s|\n\n', text)
sections = [s.strip() for s in sections if s.strip()]
chunks = []
buffer = ""
for section in sections:
est_tokens = len(section.split()) * 1.3
if est_tokens > max_tokens:
# too big — split on sentences
sentences = re.split(r'(?<=[.!?])\s+', section)
current = ""
for sent in sentences:
if len((current + " " + sent).split()) * 1.3 > max_tokens:
if current:
chunks.append(_make_chunk(current))
current = sent
else:
current = (current + " " + sent).strip()
if current:
buffer = current
elif len((buffer + " " + section).split()) * 1.3 > max_tokens:
if buffer:
chunks.append(_make_chunk(buffer))
buffer = section
else:
buffer = (buffer + "\n\n" + section).strip()
if buffer:
chunks.append(_make_chunk(buffer))
return [c for c in chunks if c.token_count >= min_tokens]
512 tokens max fits within embedding model context windows. 100 tokens minimum so you don’t get useless fragments. I also add 2-sentence overlap between consecutive chunks so that “40% increase” chunk includes what actually increased.
Attach metadata to every chunk — source file, section title, page number. When the model cites a source, you can trace it back. Without metadata, you have answers with no provenance.
Vector search with pgvector
I use PostgreSQL with the pgvector extension instead of a dedicated vector database. Simple reason: I already run PostgreSQL. Adding Pinecone or Qdrant means another service to deploy and pay for. pgvector handles the volumes I need.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
id SERIAL PRIMARY KEY,
document_id INTEGER REFERENCES documents(id) ON DELETE CASCADE,
content TEXT NOT NULL,
content_hash TEXT UNIQUE NOT NULL,
embedding vector(1536),
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX chunks_embedding_idx
ON chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
I pull 10 results even though I only use 3-5 in the final prompt. Vector similarity is a rough first pass. The re-ranking step is where real filtering happens.
Re-ranking changes everything
This is probably the single biggest quality improvement most pipelines skip.
Vector search finds chunks that are semantically similar to the query. But “semantically similar” and “actually answers the question” are different things. A chunk describing the problem in similar words might rank higher than the chunk with the actual solution.
A cross-encoder takes the query and each candidate chunk as a pair and scores how well the chunk answers the query:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") # ~60MB
def rerank(query: str, results: list, top_n: int = 5) -> list:
pairs = [(query, r.content) for r in results]
scores = reranker.predict(pairs)
for result, score in zip(results, scores):
result.score = float(score)
return sorted(results, key=lambda r: r.score, reverse=True)[:top_n]
In my testing, re-ranking bumped answer relevance by about 30-40% over raw vector search. Biggest gains on ambiguous queries where user wording doesn’t match document terminology.
Prompt structure matters more than people think
Don’t just concatenate all chunks into one big context block. The model loses track of which chunk says what, and longer context increases hallucination.
def build_prompt(query: str, results: list) -> list[dict]:
context_blocks = []
for i, r in enumerate(results, 1):
source = r.metadata.get("source", "unknown")
context_blocks.append(f"[Source {i}] ({source})\n{r.content}")
context = "\n\n---\n\n".join(context_blocks)
return [
{"role": "system", "content": (
"Answer based on the provided context only. "
"Reference sources by number. "
"If the context doesn't have enough info, say so."
)},
{"role": "user", "content": f"Context:\n{context}\n\n---\nQuestion: {query}"},
]
Numbered sources so the model can cite references. Separators to help it keep chunks distinct. And that explicit “say you don’t know” instruction — without it, the model just makes stuff up from partial context.
Where it still breaks
The most frustrating failure: right chunks retrieved, wrong answer anyway. Usually contradictory documents or outdated docs mixed with current ones. I handle this by adding timestamps to chunk metadata and telling the model to prefer recent sources.
Questions that span two chunks are hard. The answer is partly in chunk 3, partly in chunk 7. Sentence overlap helps. For complex documents, I’ve experimented with parent-child chunking: store both a large parent chunk and smaller children, search against children but pull the parent into context.
And users don’t use the same words as your documents. They say “cancel my account” when the docs say “account deactivation process.” Query expansion — generating alternative phrasings with a fast model before searching — catches a lot of these mismatches.
How it actually performs
After a few months running this for internal document Q&A: factual questions with clear answers hit about 90% accuracy. Multi-source synthesis is around 75%. Complex reasoning across many documents is closer to 50% and probably needs a knowledge graph, not just RAG.
Latency is P50 ~800ms, P95 ~2.5 seconds including re-ranking. Cost is about $0.02 per query.
If you’re starting from scratch: spend most of your effort on chunking. Bad chunks poison everything downstream. Use pgvector if you already run PostgreSQL. Add re-ranking early — the quality gain is way bigger than the effort. Structure your prompts with numbered sources. And monitor from day one. Log every query and score, review the lowest-scoring ones weekly. They tell you exactly where your pipeline is failing.