RAG systems that survive production traffic
Retrieval-augmented generation looks easy in a notebook. Then real users arrive — with weird queries, stale indexes, and corpora that grow daily. Patterns from RAG systems we've shipped to enterprise.
The notebook lies
RAG demos hide the hard problems: freshness, scale, evaluation, and the long tail of weird queries. The notebook says 'this works'. Production says 'now do it for ten million docs and a thousand concurrent users with monthly index churn'.
Three patterns that travel
Hybrid retrieval. Pure vector search loses to keyword search on entity-heavy queries. Pure keyword loses on semantic ones. Run both and fuse the results — the gain is bigger than any single-model swap.
Re-ranking. Recall is cheap. Precision is expensive. Use a fast retriever for recall (BM25 + embeddings) and a cross-encoder for the top 20 to lift precision dramatically.
Freshness as a first-class metric. Stale indexes silently degrade quality. Track end-to-end freshness (event → index → retrievable) and alert on regressions.
What to measure
Retrieval recall@K, answer faithfulness, and citation accuracy. If you can't compute these on a regression set every PR, you're shipping vibes.
RAG is a real engineering discipline now. Treat it that way.
