How we measure chatbot quality.
Faisal Al-Anqoodi · Founder · CEO
A bot that answers a thousand messages is not necessarily a good bot. These are the four numbers we track at Nuqta — and why most people track the wrong ones.
The most common mistake we see in companies that tried a chatbot before us: they measured success by the number of messages answered. The number looks beautiful in the monthly report. "The bot answered 2,400 messages." The board applauds, the manager posts a screenshot on LinkedIn, and the real customer — off-stage — silently unsubscribes.
Quality in bots is not measured by reply count. It is measured by conversation outcomes. This piece explains how we do that at Nuqta, with a method that is repeatable in any company.
The vanity metric: "messages answered."
In measurement literature, this class of number is called a vanity metric — a figure that rises automatically with usage and tells you nothing about value. "Messages answered" goes up whether the bot is giving correct answers or generic replies unrelated to the question. Either way, it "replied."
The figure below shows the problem. The same 1,000 messages the manager sees as one successful block decompose, on analysis, into three categories — two of which are silent failures:
The four numbers we track.
After twenty-two measurement reviews with clients across sectors, we settled on four indicators. Not every project tracks all of them, but any bot that does not track at least three, we treat as a pilot — not a production system:
- Resolution Rate: share of conversations that ended with a solution, without handoff to a human, and without the same question being re-asked within 24 hours.
- Handoff Rate: share of conversations that were escalated to a human. Not negative by itself, but it reveals the bot's limits.
- Silent Abandonment Rate: share of conversations the customer started, received one or more replies to, then did not continue. The single most important metric, and the most often ignored.
- Time-to-Resolution: the average time from the first message to the last message in a conversation that was actually resolved.
The chart below shows a real scorecard from a retail bot after its second month. The dashed blue line on each bar is the target we agreed with the client before launch:
What is not measured does not improve. And what is measured with the wrong number, worsens in silence.
How we define a "resolved conversation."
This simple-sounding question is the main point of contention between us and some clients in the first weeks. Our definition is strict, and intentionally so:
- The conversation was not handed off to a human.
- The customer did not re-ask the same question (or its semantic sibling) within 24 hours.
- The last messages do not contain implicit failure signals: "I did not understand," "What?" "A human, please," "I do not follow."
- If a rating was requested, it was 4 out of 5 or higher — a neutral 3 does not count.
A finer point: a customer who says "thank you" and leaves is not automatically a resolved conversation. "Thank you" in the Gulf is social protocol, not necessarily satisfaction. We look for deeper signals: did they complete the intended action (tracked the order, booked, paid)? Did they return within a week with a related question? Did their channel usage rate go up?
What we do every week.
Measurement is not fully automated. At Nuqta, every week we run what we call "the 100 review." We pick a random sample of 100 conversations, classify them by hand, and compare our labels to the bot's automatic labels. The gap between the two is the real compass. If the bot labels 80% of conversations "resolved" while we manually label them 60%, the scorecard is lying by 20 points.
This manual review is expensive. One to three hours per bot per week. But it is the difference between a product that improves and a product that fails in silence.
Three common mistakes in measurement.
- Measuring the mean without the distribution: a bot with an average rating of 4.2 may hide 30% of customers giving 1s and 2s. The mean tells you the middle, not the tails — and it is the tails that churn.
- Measuring only after the conversation ends: most customers never reply to a closing survey. Measure inside the conversation too: pauses, question repetition, linguistic signs of frustration.
- Measuring a small sample of "good" conversations: do not pick conversations the bot completed. Pick a random sample that includes the abandoned ones. Failure teaches more than success.
A real (simplified) example.
In the first month of a banking bot we launched, the automated report said: "2,400 messages, average rating 4.2." Everything looked good. After a manual review of a 100-conversation sample, we discovered:
- Actual resolution rate: 38%, not 80% as the bot had labeled.
- Silent abandonment rate: 50% — half the customers received a first reply, then disappeared.
- Root cause: the bot replied with "Please wait, checking" and then never came back. The customer thinks something is being verified, and gives up.
The fix was not a bigger model, nor new training data. It was a simple programmatic rule: any reply containing "checking" must be followed automatically by a reply within 30 seconds, or else escalate to a human. The next month, resolution rose to 64%, and abandonment dropped to 19%.
The improvement did not come from AI. It came from measurement.
Closing.
If a bot provider tells you "the bot answered X messages this month," that is not a report. It is an advertisement. Ask them, in writing, for: resolution rate, handoff rate, silent abandonment rate, and time-to-resolution — with each defined. If they cannot deliver, they are measuring a number that tells them nothing.
At Nuqta, we do not launch a bot before we build its scorecard. Not because we love numbers, but because we have seen what happens when they are missing: the bot keeps running for six months, produces beautiful reports, and loses the customer slowly. Measurement is not an add-on. It is the second half of the product.
Related posts
- Why most Arabic AI bots fail.
It is not the model. It is that we train it on Arabic no one actually speaks, then act surprised when no one understands it back.
- Running a language model inside Oman.
The vision, the engineering, the open-source models we would deploy, and the real cost — for a full year. This is not a sales deck. It is the calculation we put on the table before any client conversation that starts with: why build instead of rent?