AI · Operations·April 2026·8 min read

Five RAG metrics to check before you blame the LLM.

An engineer in Muscat opened a ticket: the model lies about leave policy. Two hours later, the retrieved chunk was from an old PDF; the new handbook was not indexed.

This article is not a defense of any single model. It is a measurement list before condemning the LLM: five metrics that tie answer quality to the full RAG path [1][2]. For architecture read what is RAG; for hybrid retrieval read hybrid search.

Metric 1 — retrieval hit rate.

On a fixed question set, how often does the gold passage appear in top-k? Without this number, model tuning is roulette [1].

Metric 2 — document coverage.

What share of questions is actually answerable from the approved corpus? Low coverage means you ask the model to compensate for missing knowledge — that is ungrounded output, not random hallucination [2].

Metric 3 — chunk conflict rate.

When two retrieved passages contradict, models tend toward a wrong middle. Count conflicts per hundred queries; fix with document governance before touching weights [3].

If the retrieved chunk is wrong or stale, the smartest model in the world will lie — confidently.

Metric 4 — tail latency p95.

Quality is worthless if latency blows SLA. Watch p95 for retrieval + generation together; variance usually comes from the index or batching on the server [4].

Metric 5 — human escalation cost.

What share of threads ends with a human? What are average escalation minutes? A rising line usually signals policy and chunking issues, not model horsepower [5].

Quick decision table.

FIG. 1 — RAG DEBUG: WHICH LAYER TO FIX FIRST

Closing.

We ran these metrics across more than twelve models and vendors at Nuqta in recent months — the repeated lesson: retrieval upgrades trust faster than swapping the model card alone.

Spend one week measuring the five before buying hardware or upgrading subscriptions. If numbers do not move, you are renaming the problem, not solving it.

Frequently asked questions.

What is the minimum eval set? Start with fifty questions that mirror real channels; smaller sets lie [1].
Do I need expensive tooling? No — spreadsheets suffice at first if labeling is disciplined.
When do I change the model? When the five metrics are stable and errors are linguistic or formatting — see fine-tuning vs prompting.
How does measurement tie to compliance? Log retrieval over sensitive docs; align with PDPL.
If time is tight, start where? Hit rate at k — it explains half the story in a day.

Sources.

[1] Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — NeurIPS 2020 / arXiv.

[2] Es et al. — RAGAS: Automated Evaluation of Retrieval Augmented Generation — arXiv, 2023.

[3] Gao et al. — Retrieval-Augmented Generation: A Survey — arXiv, 2023.

[4] Kwon et al. — Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM) — SOSP 2023 / arXiv.

[5] Nuqta — internal RAG evaluation dashboards for Gulf deployments, April 2026.

Hybrid search — combining lexical and vector retrieval.
This is not a vendor badge. It is an architecture decision: when token overlap saves you, when embedding similarity saves you, and how to fuse both without doubling cost with nothing to measure.
What is RAG — and why your company bot answers like a stranger.
A practical guide to Retrieval-Augmented Generation: how your bot reads documents before answering, and why it costs 10× less than fine-tuning.
What is fine-tuning — and how it differs from prompting.
Half the meetings say "we will tune the model" while they mean "we will rewrite the prompt." The two complement each other — but one changes the text going in, and the other can change the model's weights. That distinction clarifies the decision and saves you from training costs you did not need.
Hallucinated citations — auditing RAG source links before you trust the UI.
The UI shows a "source" while the paragraph is missing, truncated, or from the wrong page. This article gives a practical audit path before you ship the assistant to staff or customers.
Inference vs training for LLMs — who pays for what.
Training might run once (or for many hours) and you pay a cluster bill. Inference runs forever and turns a model into a per-token Opex line. This article separates the two checkbooks so pilot budgets are not mixed with product bills [1].

Explore the hub

Arabic & AI

Arabic LLMs, model comparisons, and conversational agents.

Share this article

X (Twitter)LinkedIn WhatsApp

← Back to the JournalNuqta · Journal