RAG demoda çalışır. Üretimde neden düşer?

01Demo neden hep çalışırWhy the demo always works

Demo koşulları kusursuzdur: on temiz PDF, iyi niyetli üç soru ve sonucu değerlendiren kişinin cevabı zaten bilmesi. Framework'ler bu senaryoyu beş dakikada kurdurur — ve kurmalısınız da; hızlı prototip, problemi anlamanın en ucuz yolu.

Demo conditions are flawless: ten clean PDFs, three well-intentioned questions, and an evaluator who already knows the answer. Frameworks let you build this in five minutes — and you should; a fast prototype is the cheapest way to understand the problem.

Üretim koşulları ise tam tersi: on bin belge, çelişen sürümler, yazım hatalı sorgular, cevabın olmadığı sorular ve yanlış cevabın parayla ölçüldüğü bir ortam.

Production is the opposite: ten thousand documents, conflicting versions, misspelled queries, questions with no answer, and an environment where a wrong answer is measured in money.

RAG projeleri retrieval'da değil, disiplinde başarısız olur.

RAG projects don't fail at retrieval. They fail at discipline.

02Retrieval kalitesi: sistemin tavanıRetrieval quality: the system's ceiling

Model ne kadar iyi olursa olsun, yanlış bağlam geldiyse doğru cevap şans işidir. Üretimde farkı yaratan üç pratik:

However good the model, if the wrong context arrives, a correct answer is luck. Three practices that make the difference in production:

Chunking'i veriye göre tasarlayın. Sözleşme maddesi, ürün tablosu ve destek makalesi aynı şekilde bölünmez. "512 token, 64 overlap" bir başlangıçtır, strateji değildir.
Design chunking around the data. A contract clause, a product table and a support article don't split the same way. "512 tokens, 64 overlap" is a start, not a strategy.
Hibrit arama kullanın. Vektör benzerliği "İade süresi kaç gün?" için harika, "TZ-4810B stok kodu" için kötüdür. BM25 + vektör + yeniden sıralama (rerank) üçlüsü, tek başına embedding'den belirgin iyidir.
Use hybrid search. Vector similarity is great for "how many days for returns?" and bad for "SKU TZ-4810B". BM25 + vectors + a reranker beats embeddings alone, consistently.
Metadata'yı ciddiye alın. Tarih, sürüm, dil, erişim yetkisi filtrelenmeden retrieval yapılırsa model eski ya da yetkisiz içerikten "doğru" cevap üretir.
Take metadata seriously. Retrieve without filtering on date, version, language and permissions, and the model will produce a "correct" answer from stale or unauthorized content.

03Eval seti olmadan iyileştirme yokturNo eval set, no improvement

"Bugün daha iyi görünüyor" bir metrik değildir. Üretime giden her RAG sisteminde şunu kurarız: gerçek kullanıcı sorularından derlenmiş, beklenen kaynağı ve cevabı işaretlenmiş bir eval seti — ve her değişiklikte otomatik koşan bir skor.

"It looks better today" is not a metric. For every RAG system headed to production we build an eval set — real user questions with expected sources and answers labeled — and a score that runs automatically on every change.

eval.yaml — örnek / example

# her PR'da koşar / runs on every PR
case: "iade süresi — üyeliksiz sipariş"
expect_source: "iade-politikasi-v3.md"
expect_contains: "14 gün"
reject_if: "30 gün"  # eski politika / old policy

Bu set küçük başlar (50 soru yeter), her üretim hatası sete yeni bir vaka olarak döner. Altı ay sonra elinizde sistemin gerçek hafızası olur.

The set starts small (50 questions is enough); every production failure returns as a new case. Six months later you hold the system's real memory.

04Guardrails ve gözlemlenebilirlikGuardrails and observability

Modelin "bilmiyorum" diyebilmesi bir üründür; kaynak gösteremeyen cevabın reddedilmesi bir kuraldır. Bunun yanına her cevap için iz kaydı koyarız: hangi sorgu, hangi parçalar, hangi skorlar, kaç ms, kaç token. Bir kullanıcı "saçma cevap verdi" dediğinde bakılacak yer log değil, trace olmalı.

A model that can say "I don't know" is a feature; rejecting answers that can't cite a source is a rule. Next to that we record a trace per answer: which query, which chunks, which scores, how many ms and tokens. When a user says "it gave a nonsense answer", the place to look should be a trace, not grep.

Kısa liste. Üretime çıkmadan: hibrit arama ✓ · eval seti + CI ✓ · kaynak zorunluluğu ✓ · trace ✓ · maliyet/istek limiti ✓. Bu beşi yoksa, elinizdeki hâlâ bir demodur.Before go-live: hybrid search ✓ · eval set + CI ✓ · mandatory citations ✓ · tracing ✓ · cost/request limits ✓. Missing these five, what you have is still a demo.

RAGLLMEvalProduction AI

RAG demoda çalışır. Üretimde neden düşer?RAG works in the demo. Why does it fail in production?

01Demo neden hep çalışırWhy the demo always works

02Retrieval kalitesi: sistemin tavanıRetrieval quality: the system's ceiling

03Eval seti olmadan iyileştirme yokturNo eval set, no improvement

04Guardrails ve gözlemlenebilirlikGuardrails and observability