How do you build a QA scorecard your AI can score consistently — for your people and your AI agents?

Most quality teams in financial services hold a calibration session every week. Three or four reviewers sit down, score the same handful of calls independently, and then spend an hour arguing about why one of them gave a 4 on “demonstrated empathy” where another gave a 2. They reach a rough truce, write a note in the scoring guide, and do it all again next week. The industry calls this best practice. It’s worth asking what it actually is.

It’s a tax. You are paying QA leads to meet, indefinitely, to paper over criteria that don’t say clearly enough what they mean. The standard advice is to keep calibrating until your evaluators agree more than 85% of the time. The better guides admit the quiet part. If you’re stuck below that line, the fix is sharper criteria, not more meetings.

So let’s talk about how you write criteria sharp enough that two reviewers stop disagreeing — and, increasingly, sharp enough that an AI can apply them the same way every time, to your people and to the AI agents now handling your calls.

The test every criterion has to pass

Here’s the question to hold over every line on your scorecard. If I marked an agent down on this, could I point to the exact moment in the transcript that justifies it, and would a second person, looking at the same moment, agree?

If yes, the criterion is doing its job. If you find yourself reaching for “well, it’s more of a general feel across the call,” the criterion is the problem. It isn’t describing a behaviour. It’s describing an impression, and impressions are where reviewers diverge, where agents lose trust, and where a score falls apart the moment someone disputes it.

This is the whole game. A scorecard that works is one built entirely of observable behaviours — things a person could find and play back, things a machine could find and quote. Everything that follows is in service of that.

A worked example, from “showed empathy” to something you can point to

Take a criterion almost every scorecard has in some form, Agent showed appropriate empathy.

A human reviewer roughly knows what that means, which is exactly the trouble. “Roughly” is where the 2-versus-4 argument comes from. A new starter doesn’t know what it means at all. An AI scorer certainly doesn’t. So you get inconsistency from your people and noise from your automation, and you can’t tell which calls genuinely went wrong.

Now rewrite it as a behaviour.

Acknowledged the customer’s stated problem in their own words before moving to a solution.

That’s a thing you can find in a transcript. The customer says they’re worried about an unexpected charge; the agent either reflects that worry back before launching into process, or they don’t. Two reviewers will agree because there’s nothing to interpret. A machine will score it consistently for the same reason. And when an agent challenges the mark, you have a specific line to point at, not a feeling to defend.

The rewrite does three jobs at once, and it’s worth naming them because they’re the jobs every criterion should do. It makes the score consistent (people and machines land in the same place), evidence-linked (it ties to a moment, not a vibe), and disputable (the agent can see exactly what was marked and argue if they think the call shows otherwise). Hold onto that third one. It does more work than it looks like.

Two kinds of criteria, and why you need both

Once a criterion is traceable, there’s a second decision that shapes the whole scorecard. Is it scoring what the agent did, or what the conversation achieved? Both belong on the card, and they do different jobs.

Process criteria score what happened in the conversation — the steps taken and the behaviours shown. Did the agent verify identity, give the required disclosure, acknowledge the customer’s concern before moving on? Many of these are clear-cut. The step happened or it didn’t, so score them as a clean pass or fail and don’t invent a spectrum where there isn’t one. Process is the part you coach, because it’s specific enough to practise.

Outcome criteria score what the conversation achieved for the customer. Did they understand the fee they’d just agreed to? Was the issue actually resolved, or just ended? Did a customer showing signs of vulnerability get routed to the right support? Outcome is the part a process-only scorecard quietly misses. An agent can tick every step and still leave a confused customer, and a card that only checks the steps will mark that call a success.

You want both, because neither stands on its own. Outcome alone tells you the call went wrong but not where, so you can’t coach it. Process alone tells you every box was ticked but not whether it worked. Score both against the transcript and you can see that the outcome missed and the exact step where it slipped.

The three fixes most scorecards need

When we look at a financial services scorecard for the first time, the same problems show up almost every time. Fixing them is most of the work.

Split the criteria that bundle two things. A line like “asked relevant questions and reassured the customer” is two behaviours wearing one score. An agent can do the first brilliantly and skip the second, and now your reviewer has to average two unrelated things into a single number nobody can interpret. One criterion, one behaviour. Split it.

Merge the criteria that secretly measure the same thing. Most scorecards carry a cluster like “listened actively,” “understood the need,” “asked good questions,” where an agent who scores well on one scores well on all of them. That cluster is the engine of your calibration debates, because reviewers can’t tell which bucket a given moment belongs in. Collapse it into one well-scoped criterion and the arguments stop.

Pull out the things that aren’t behaviours at all. Handle time, hold time, dead air — these end up on quality scorecards constantly, and they don’t belong there. They’re operational metrics, worth tracking, but they’re not things an agent did well or badly in the way a disclosure or a resolution is. Mixing them in punishes an agent for a long call that was long because they did the right thing. Track them separately.

The score has to survive a challenge

Here’s the part that matters most for a regulated operation, and the part most scorecard guides skip.

A quality score isn’t just a measurement. It’s a claim about how someone did their job — and sooner or later, someone will contest it. An agent will say “that’s not what the call showed.” A team leader will ask why one of their people was marked down. And if AI is doing the scoring, you have a new failure mode. The machine can be confidently wrong, flagging a missed disclosure that the agent actually gave thirty seconds earlier than the model looked.

The defence against all of this is the same, and it’s the most important design choice on the scorecard. Every score traces back to the moment that earned it. Open the score, you see the line in the transcript. Nothing about the mark is hidden — the agent, the team leader, and the regulator are all looking at the same evidence you scored from. That openness is what makes a score defensible rather than merely asserted, and it’s what lets a wrong score, human or machine, get caught and corrected instead of quietly standing. A score you can’t trace to a moment is a score you can’t defend and can’t dispute, which means it’s a score nobody should trust, least of all you.

Build the scorecard so every criterion ties to something findable, and you get this for free. Build it out of impressions, and no amount of evidence-linking can save it, because there’s no specific moment to link to.

The same rubric now grades your AI agents — and tells you different things about each

This is the shift worth getting ahead of. The AI agents now handling your simpler contacts are part of your workforce, and the question you ask about them is the one you already ask about people. Are they any good, and how do you know?

The answer is that a scorecard built the way we’ve described works on an AI agent without modification. Observable behaviours are observable whoever performed them. If “acknowledged the customer’s stated problem before moving to a solution” is the standard for your people, it’s the standard for the bot too. One rubric, one quality bar across the whole workforce, which is far saner than running a separate quality system for the software and hoping the two standards line up. (We’ve written separately on holding human and AI agents to one quality bar.)

But the same rubric tells you different things about each. Run it across both and a pattern emerges that’s genuinely useful. AI agents tend to ace the process criteria. They never forget the disclosure, never skip verification. Where they slip is on outcomes, confidently giving a wrong answer or missing the hesitation that should have flagged a vulnerable customer. Humans tend to fail the other way, getting the outcome right while skipping a step under pressure. The single rubric doesn’t flatten your workforce into one number. It shows you precisely where each half of it needs attention.

One caution, because it’s the obvious objection. If a vendor both builds AI agents and scores them, as Future Ready does, then “our agents score well” is marking your own homework, and you should treat it that way. The grades only mean something if the standard is yours: scored on your rubric, every score tied to the transcript moment, every score open to challenge, and any improvement measured causally against an untrained-peer baseline rather than asserted. The customer owns the standard, or the score is just marketing. That’s the test to put to us, and to anyone else selling you agents.

What you’re actually building

It’s tempting to treat the scorecard as a form — a thing you fill in after the call. It isn’t. It’s the written standard your entire workforce is held to, the contract every score refers back to, and the thing you’ll point at when a regulator, a team leader, or an agent asks “says who?”

So build it to be pointed at. Every criterion traceable to a moment, process and outcome both on the card, every score tied to the line that earned it. Get that right and the weekly calibration argument fades, your people trust their scores, your AI agents get held to the same bar as your best humans, and the whole thing stands up when someone leans on it. (For where this scorecard sits in the wider system, see what closed-loop quality assurance actually means.)

Want a second pair of eyes on your current scorecard? We’ll take your existing criteria and show you exactly where they’d trip up a machine — and where they’d trip up a second human, which is usually the same place.