How do you know if your AI customer service agents are actually any good?

An AI agent at an insurer takes a call about a missed premium. The customer asks whether a payment holiday will affect their cover, and the agent says no, in fluent and reassuring language. It is wrong: a payment holiday on that product suspends a benefit the customer relies on. Nothing about the conversation looks wrong, though. It was handled in ninety seconds, the customer rang off without complaint, and the contact closed as resolved. The first anyone hears of it is six weeks later, when the customer tries to use the suspended benefit and can’t.

That call sits in someone’s AI performance dashboard this quarter, filed under success. Handle time, excellent. Containment, a win, because it never reached a human. Satisfaction, fine, because the customer didn’t yet know they’d been told something false. Every number the operation has says the agent did well. The agent gave a customer wrong information about their insurance. Both of those are true at the same time, and the gap between them is what this post is about.

You can’t tell whether your AI agents are good from the metrics you already have. Average handle time, satisfaction scores and containment rates are trailing signals, and they routinely record a failed conversation as a success. The only reliable answer comes from scoring every one of the agent’s conversations against your own quality rubric, finding the failure patterns, fixing them, and re-measuring to prove the fix worked. It’s the same loop you’d run on a person.

The metrics you’d reach for can’t see the failure

Ask most operations how their AI agents are doing and you’ll get the same numbers they use for everything else: handle time, containment or deflection rate, a customer satisfaction score, maybe first-contact resolution. These are the trailing signals the Microsoft Dynamics team flagged in February 2026, when it wrote that despite heavy investment, most organisations still lack a coherent way to measure whether their AI agents are good — and that the usual metrics don’t tell you whether an agent is competent, reliable, or improving.

Worse, these metrics actively mislabel quality. When an AI agent can’t resolve a multi-step request but ends the conversation anyway, the deflection rate records it as handled. The customer experienced a dead end; the dashboard logged a save. And most of those customers won’t correct the record for you. Roughly half of unhappy customers never complain — they just leave, which means your satisfaction scores and complaint counts are quietly undercounting the failures you most need to see.

The scale of the mismatch is now measurable. Qualtrics surveyed more than 20,000 consumers across 14 countries in late 2025 and found that almost one in five people who used AI for customer service got no benefit from it at all — a failure rate close to four times higher than for AI used elsewhere. So the tool you’ve deployed to handle volume is failing at several times the rate you’d tolerate from any other system, and the metrics on your wall are reporting it as healthy. A clean dashboard, in other words, is not evidence that your agents are good. Often it’s evidence that you can’t see what they’re doing.

AI agents fail in a way built to hide from you

A struggling human agent leaves traces. Their handle time creeps up, they escalate more, a team leader overhears a call going sideways, or they simply ask for help. You find out. An AI agent that has learned something wrong gives no such tells. It fails in calm, confident sentences, and it fails the same way every time. One incorrect belief about a product becomes the identical wrong answer across ten thousand conversations before a single person notices.

This is what makes the headline cases instructive, even when they’re not about financial services. Cursor’s support agent invented a policy that didn’t exist and the false answer spread through its user base, triggering cancellations, before anyone on the team realised. Air Canada’s chatbot told a grieving customer about a bereavement refund that wasn’t real, and a tribunal held the airline responsible for what its agent had said. A bank or an insurer should take a narrower lesson than “bots are risky.” The output is yours the moment the agent speaks, and the failure is built to look exactly like a success until a customer forces it into view.

Here is the finding that should reframe how you read your own reporting. When the communications platform Sinch surveyed over 2,500 enterprise decision-makers in 2026, 74% had been forced to roll back or shut down a live AI agent. Among the programmes with the most mature monitoring, the rollback rate was higher still, at 81%. Their agents weren’t worse. They could simply see the failures the less-instrumented operations were sailing past. The organisations reporting no problems were often the ones with the least visibility into what their agents were actually saying.

Read that twice, because it inverts the usual instinct. A spotless AI dashboard is more likely to mean you’re blind than that you’re fine. We’ve argued before that scoring a 1–5% sample of conversations is quality guessing rather than quality assurance, and that scoring 100% is the only honest baseline. That case is stronger for AI than it ever was for people, precisely because the AI failure is invisible to everything else you measure.

”Any good” is really three questions

“Is it any good” bundles three questions, and most operations only ever answer the first.

Is it competent? Does the agent get the outcome right, judged against your standard rather than its own fluency? The correct policy explanation, the right escalation, the moment it should have noticed a vulnerable customer. Fluent and correct are not the same thing, and only the second one keeps a customer whole.

Is it improving? Or did you pass it at launch and move on? An agent that was acceptable on day one and has been static ever since is not a fixed asset. It’s an unmonitored one.

Is it degrading? This is the question almost nobody asks. IDC noted in mid-2026 that without active tuning, agent performance drifts downward as the world shifts and edge cases accumulate. The model is frozen; your business isn’t. You launch a new product, reprice a policy, change a hardship process, and a fraud pattern you’ve never seen turns up on Tuesday. An agent that passed every test in the spring can be confidently wrong by the autumn, on exactly the conversations that matter most. Which is why “are my agents good” is a question you keep answering, not a box you tick at go-live.

The answer is the loop you’d run on a person

You already own the method. You find out whether a human agent is any good by scoring the work, coaching the gap, and checking it changed on the next call. You find out about an AI one the same way. Three steps, applied to software.

Score every conversation on your own rubric. Not the vendor’s benchmark, and not a generic model of a good call. The same observable-behaviour criteria you’d hold a person to, written so a machine can apply them consistently, run across 100% of the agent’s conversations. Observable behaviours are observable whoever performed them, so the rubric that scores your people scores the agent without translation.

Fix the exact gap. With a person, this is coaching, usually voice role-play against the kind of call they fumbled. With an AI agent the step looks different but the job is identical. The scores cluster the failures for you: the agent misstates this one product, or mishandles this one vulnerability signal, in a recognisable pattern. You fix the cause rather than the symptom — the prompt, the knowledge it retrieves from, a guardrail, an escalation rule that should have fired. Either way you’re turning a found failure into a specific change.

Re-measure, causally. Then prove the fix worked. Did the corrected behaviour actually hold on subsequent real conversations, or did the average merely drift for reasons that have nothing to do with your change? The only honest answer compares the same behaviour after the fix against a baseline, so you can say the change caused the gain rather than coincided with it. That distinction between a causal result and a vanity number is the whole difference between knowing your agent improved and hoping it did.

Run that loop and a useful pattern surfaces almost immediately: AI agents and humans tend to fail in opposite directions. The agent nails every process step and then slips on the outcome, confidently giving a wrong answer or missing the hesitation that should have flagged a vulnerable customer. People do the reverse, reaching the right outcome while skipping a step under pressure. Holding both to one quality bar doesn’t flatten your workforce into a single number. It shows you precisely where each half of it needs attention.

”But you’re marking your own homework”

There’s an objection that any vendor who both builds AI agents and grades them has to answer, and Future Ready is one of those vendors, so it lands on us as squarely as on anyone. If the company selling you the agent is also the company telling you it scores well, the grade is worth roughly nothing on its own. It isn’t a problem unique to us, either. Microsoft scores its Dynamics agents against its own Multimodal Agent Score; Intercom measures Fin on Intercom’s metrics. Useful internally, but that’s the vendor’s standard applied to the vendor’s agent, which is the definition of marking your own homework.

The grade only means something if the standard is yours. So here are four questions to put to anyone who tells you their AI agent is good — us included, and harder on us than on most:

Is it scored on my rubric, or on yours?
Does every score point to the exact moment in the transcript that earned it, so I can check it myself?
Can I dispute a score, and will it change if I’m right?
Is any improvement measured causally against a baseline, or is it just an average that happened to go up?

If the answer to any of those is no, what you’re being shown is marketing, not measurement. If the answer to all four is yes, then you own the standard, the evidence sits under every score, and the vendor’s incentive to flatter itself has nowhere to hide. That’s the bar an agent should clear before it goes in front of your customers, and it’s the substance behind the idea that the best agents arrive already carrying their own proof rather than asking you to take them on faith.

Why this is worth getting right now

The volume going to AI agents is only heading one way. Gartner expects them to autonomously resolve around 80% of common service issues by 2029, which leaves your people the hard, emotional fifth the agents can’t handle. A botched AI resolution doesn’t vanish; it lands on one of those humans as a harder call later. So getting AI quality right and human quality right stop being two projects. They’re the same one, measured the same way.

You can’t scale what you can’t measure, and in a regulated operation you can’t defend it either, to a board or a regulator or a customer who was told something untrue about their money. The operations pulling ahead with AI agents can see their agents fail, and they can prove they fixed it. A tidy dashboard never showed anyone that. This is the loop Future Ready was built around, applied to AI agents rather than people, and it’s why our customers see an average 3.7x return on it: the work that catches the failure is the same work that improves the agent.

A clean dashboard is a claim. The loop is the evidence. Only one of them survives the call where the agent got it wrong.

Want to know whether your AI agents are actually any good? We’ll score a week of their real conversations on your own rubric and show you the failure patterns hiding behind the dashboard — and what fixing them would be worth.