How do you evidence good outcomes on every call, not just the ones you sampled?

Somewhere in a board paper this quarter, your firm will state that it delivers good outcomes for its customers — that people understand what they’re buying and get the support they need. Underneath that statement, in most operations, sits the 1–5% of calls a quality team can review by hand. The statement is about every customer you have. The evidence is about a few of them.

That gap is what this post is about, and at heart it isn’t a regulatory one. The distance between what a firm asserts about its customers and what it can actually show is there with or without a regulator looking. What regulation does is make someone check. The UK’s Consumer Duty is the most explicit published version of a direction every major conduct regulator is moving in: it asks a firm’s board to report each year on the outcomes customers actually received, not the activity behind them. It’s worth reading closely even outside the UK, as the example rather than the subject. The job it describes, proving good outcomes to the board that signs them off, is the same whether you answer to the FCA, a Nordic supervisor, or your own risk committee.

Evidencing a good customer outcome means showing, with proof you can trace, that a customer understood what they agreed to and got the support they needed. Activity metrics (calls handled, average handle time, survey scores) describe effort, not outcomes. Evidence of the outcome itself lives in what was said on the call, which means you can only produce it for the calls you actually examined. Examine a sample, and you can evidence a sample.

Activity is not evidence

Walk into most quality reviews and you’ll find no shortage of data. Handle times, contact volumes, first-contact resolution, post-call survey scores, complaint counts. This is the management information (MI) every contact centre already produces, and a board report can be thick with it.

None of it evidences an outcome. A short handle time is consistent with a customer who understood everything and a customer who gave up asking. A high survey score tells you how the customers who answered the survey felt, which is not the same as how the silent majority fared. Complaint volumes count the people who complained, not the people who were quietly left worse off and never said so. These are proxies, and the FCA said as much in its April 2026 reflection on second-year board reports: some firms presented extensive data without explaining how it actually demonstrated good or poor outcomes, and boards should push for analysis that goes beyond the dashboard.

That phrase, “beyond the dashboard,” is the whole job. An outcome is something that happened, or didn’t, to a specific customer on a specific call. Did they understand the fee they’d just agreed to? Was the person showing signs of distress actually helped, or just processed? You evidence that by looking at the conversation. There is no proxy that stands in for it.

Two of the four outcomes live inside the conversation

It’s worth being precise about where this bites, because not everything a board signs off sits in the same place. The UK’s Consumer Duty happens to name the split cleanly, in four outcomes, and the distinction travels well beyond it. Products and price are largely designed and governed upstream, before a customer ever calls. You can evidence them from product reviews and fair-value assessments without listening to anyone.

Consumer understanding and consumer support are different. They are delivered in the interaction itself, by an agent or increasingly by an AI agent, one conversation at a time. Whether a customer genuinely understood is something that either shows up in what they said back, or it doesn’t. Whether they were supported is visible in how the call was handled when it got difficult. These two outcomes are produced live, on the call, and that is exactly where they have to be evidenced.

It’s also exactly where the FCA says firms are thinnest. Its April 2026 note observed that some reports still leaned more on products, services and value than on understanding and support, and asked firms to be able to evidence how they test communications, assess whether customers actually comprehended them, and respond when a customer’s behaviour shows they didn’t. Every one of those is a thing you find by examining the conversation. Which brings us back to the sample.

Why a sample can’t carry the claim

We’ve made the full case elsewhere for why sampling 1–5% of calls is quality guessing rather than quality assurance, but the point lands differently when the job is evidence rather than coaching.

A reviewer pulling five calls a month for an agent tends to catch that agent on five ordinary days. The calls where understanding broke down, where a vulnerable customer was missed, where support quietly failed — these are statistically unlikely to land in a small random pull. So the sample is biased towards the unremarkable, which is a manageable problem when you’re spotting coaching gaps and a serious one when you’re standing behind a statement to your board.

Picture a retail bank’s support line. A customer in financial difficulty asks about a payment deferral, doesn’t follow the explanation of how it will mark their credit file, agrees anyway, and rings back distressed two months later when a loan application is declined. The first call is the one that needed to be evidenced — understanding failed, and a vulnerable customer wasn’t supported through it. In a 3% sample, the odds of that call being reviewed are roughly three in a hundred. The board report, written from the sample, will say the firm supports customers in difficulty well. It will be a true description of the calls you happened to look at and a false one about the call that mattered.

You cannot evidence an outcome you never observed. Sampling leaves that hole open by design, and no amount of careful report-writing closes it.

What board-ready evidence actually looks like

Scoring 100% of conversations is the part that’s no longer hard — automated scoring the moment a call ends is something the category can now do. But coverage on its own just produces a bigger dashboard, and the FCA has already said the dashboard isn’t the point. The thing that turns coverage into evidence is what each score is tied to.

A defensible outcome judgement points to the moment that earned it. The score that says a customer didn’t understand the fee opens onto the exact exchange where they asked twice and the agent moved on. The score that says a vulnerable customer was supported well opens onto the moment the agent caught the signal and changed tack. Open the score, see the evidence. That traceability is what lets a board, an internal auditor, or the FCA look at the same thing you looked at, rather than taking your summary on trust. (It’s the same property that lets an individual score survive a challenge from the agent it was applied to, scaled up to a claim a board has to sign.)

This matters more now that the FCA has asked boards to evidence the challenge they brought to the report — to show, in the minutes, what they tested and questioned. A board can only test evidence it can actually inspect. “Trust the 2% we sampled” is not testable. “Here is every call where understanding broke down, with the moment it happened attached” is.

The action you took last year — did it actually work?

There’s a second evidence gap the board report exposes, and it’s the one the FCA was sharpest about in its first-year review. Firms described actions they had taken in response to a problem, but couldn’t show that the action had actually fixed it. The remediation was asserted, not evidenced.

This is the part of the loop most quality programmes skip. You find a problem (say, agents skipping the affordability explanation on a particular product), you coach it, and you write in next year’s report that you addressed it. But did the behaviour change on subsequent real calls? A rising average doesn’t prove it; averages drift for all sorts of reasons that have nothing to do with your intervention. The only honest answer comes from measuring the same behaviour again afterwards, on real calls, against a baseline of comparable agents who weren’t coached on it. If the coached group improved and the others didn’t, your action caused the change. That’s the difference between a causal result and a vanity number, and it’s what turns “we took action” into “we took action and here’s the proof it worked.”

A board report built this way reads differently. It doesn’t just list problems found and actions planned. It closes last year’s actions with evidence of their effect, which is the standard the FCA has been moving towards report by report.

When someone else is on the call

One more gap, and it’s the one most firms have thought about least. The FCA’s April 2026 note singled out the monitoring of outcomes in distribution chains and outsourcing arrangements as often weak, and was blunt about why it doesn’t let you off the hook: you are responsible for the outcomes your products deliver regardless of who interacts with the customer.

If part of your customer contact runs through an outsourced partner, your evidence of those outcomes is usually thinner than your evidence of your own — a sample of someone else’s calls, summarised in someone else’s MI. Full coverage of the outsourced conversations, scored on your rubric rather than theirs, is how you actually discharge that oversight instead of assuming it. The same logic reads the other way round for a business process outsourcer: being able to hand your financial services client board-ready evidence across every call you handle for them, on their standard, is fast becoming the difference between a partner they trust and one they audit nervously.

The report is a deadline. The evidence is a system.

The annual board report has a date on it. In the UK it’s 31 July this year, for the third cycle of Consumer Duty reports; wherever you sit, your own reporting calendar sets one. The monitoring underneath it shouldn’t. Most firms experience report season as a scramble, assembling MI and chasing examples in the weeks before the deadline to support a conclusion they’ve half-formed already. That’s a symptom of evidence being generated to order rather than continuously.

Score every conversation as it happens, with each outcome judgement tied to its moment, and report season changes character. The evidence is already there, accumulating call by call across the whole year. Writing the report becomes a matter of reading what the system already holds, not manufacturing a case against a deadline. And the same body of evidence answers the regulator’s harder follow-up questions — the ones that come after the report, when someone asks to see the calls behind a particular claim.

This is the loop Future Ready was built around, applied to the specific job of evidencing outcomes rather than coaching agents, and it’s why our customers see an average 3.7x return on it: the work that produces the proof is the same work that improves the calls. You don’t run a quality programme and an evidence-gathering exercise as two separate things. They’re the same thing, looked at from two ends.

A programme that samples can tell your board the firm probably delivers good outcomes. A programme that scores every call can show them, with the calls attached. Only one of those survives the next question.

Before report season, see what your own evidence base actually looks like. We’ll run the loop on a week of your real conversations and show you which outcomes you can already evidence on every call — and which ones your current sample is quietly guessing at.