When should we start RAG evaluation?

Week one of the project, before you tune prompts or swap models. Eval is how you know a change helped.

How large should a golden dataset be?

Start with 50–200 questions covering citations, refusals, and escalations—not just final answers.

What metrics matter most for RAG?

Citation precision, answer faithfulness, refusal rate, and p95 latency—track all four, not a single score.

Eval-driven development for RAG features

The fastest way to ship a RAG feature your customers don't trust is to ship it without an eval harness. We've now run six production RAG projects and the pattern is the same every time:

The first prototype demos beautifully on cherry-picked questions
A real user asks a real question and gets a confidently wrong answer
The team starts tweaking prompts, and starts shipping silent regressions

Why do RAG features fail in production?

Demos optimize for the happy path. Production optimizes for edge cases—ambiguous questions, stale documents, and users who trust confident tone more than citations. Without eval, every "fix" is a guess.

What is the minimum eval setup?

You need three things, and you need them in week one:

A locked golden dataset. 50–200 questions with expected behaviour. Not just expected answers — expected citations, expected refusals, expected escalations.
A scoring pipeline. Run the eval against your full RAG pipeline (retrieval + synthesis), score it, and store the result.
A CI gate. No prompt or model change ships without running the eval. If it regresses, the change doesn't merge.

// example: minimal eval runner
for (const item of dataset) {
  const result = await ragPipeline.answer(item.question);
  const score = grade({ expected: item, got: result });
  await storeRun({ runId, item, result, score });
}

Which metrics should you track?

Don't reduce eval to a single number. Track at minimum:

Citation precision: are the cited sources actually relevant?
Answer faithfulness: does the answer follow from the sources?
Refusal rate: do you refuse when the data isn't there?
Latency p50 and p95: nobody likes a slow copilot

Key takeaways

Treat eval as product infrastructure, not a research side quest.
Score behaviour users feel: citations, refusals, and speed—not BLEU alone.
Gate merges on eval so prompt tweaks cannot silently regress quality.

Eval-driven RAG isn't glamorous. It's the difference between a demo and a system you can put in front of paying customers without losing sleep.