Skip to main content
← Back to the Journal
AI · Operations·April 2026·8 min read

Five RAG metrics to check before you blame the LLM.

Faisal Al-Anqoodi · Founder & CEO

Before you raise model spend or switch vendors, measure retrieval, chunks, and escalation. Most production hallucination starts in documents and indexes — not parameter count.

An engineer in Muscat opened a ticket: the model lies about leave policy. Two hours later, the retrieved chunk was from an old PDF; the new handbook was not indexed.

This article is not a defense of any single model. It is a measurement list before condemning the LLM: five metrics that tie answer quality to the full RAG path [1][2]. For architecture read what is RAG; for hybrid retrieval read hybrid search.

Metric 1 — retrieval hit rate.

On a fixed question set, how often does the gold passage appear in top-k? Without this number, model tuning is roulette [1].

Metric 2 — document coverage.

What share of questions is actually answerable from the approved corpus? Low coverage means you ask the model to compensate for missing knowledge — that is ungrounded output, not random hallucination [2].

Metric 3 — chunk conflict rate.

When two retrieved passages contradict, models tend toward a wrong middle. Count conflicts per hundred queries; fix with document governance before touching weights [3].

If the retrieved chunk is wrong or stale, the smartest model in the world will lie — confidently.

Metric 4 — tail latency p95.

Quality is worthless if latency blows SLA. Watch p95 for retrieval + generation together; variance usually comes from the index or batching on the server [4].

Metric 5 — human escalation cost.

What share of threads ends with a human? What are average escalation minutes? A rising line usually signals policy and chunking issues, not model horsepower [5].

Quick decision table.

FIG. 1 — RAG DEBUG: WHICH LAYER TO FIX FIRST

Closing.

We ran these metrics across more than twelve models and vendors at Nuqta in recent months — the repeated lesson: retrieval upgrades trust faster than swapping the model card alone.

Spend one week measuring the five before buying hardware or upgrading subscriptions. If numbers do not move, you are renaming the problem, not solving it.

Frequently asked questions.

  • What is the minimum eval set? Start with fifty questions that mirror real channels; smaller sets lie [1].
  • Do I need expensive tooling? No — spreadsheets suffice at first if labeling is disciplined.
  • When do I change the model? When the five metrics are stable and errors are linguistic or formatting — see fine-tuning vs prompting.
  • How does measurement tie to compliance? Log retrieval over sensitive docs; align with PDPL.
  • If time is tight, start where? Hit rate at k — it explains half the story in a day.

Sources.

[1] Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — NeurIPS 2020 / arXiv.

[2] Es et al. — RAGAS: Automated Evaluation of Retrieval Augmented Generation — arXiv, 2023.

[3] Gao et al. — Retrieval-Augmented Generation: A Survey — arXiv, 2023.

[4] Kwon et al. — Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM) — SOSP 2023 / arXiv.

[5] Nuqta — internal RAG evaluation dashboards for Gulf deployments, April 2026.

Related posts

Share this article

← Back to the JournalNuqta · Journal