Your average quality score is up nine points since you rolled out the new coaching programme. The slide looks like proof. Leadership reads it as proof. It isn’t proof of anything, and the distance between what that number shows and what it’s being taken to mean is where most coaching budgets quietly lose their credibility.
Sooner or later someone whose job is to be unconvinced, usually the CFO, asks the one question that decides whether the spend survives the next budget round: how do you know the coaching did that? If the honest answer is “the number went up after we started,” you don’t have a result. You have a coincidence you’re hoping is a result.
This post is about the difference between those two things, and how you build the kind of number that survives the question.
A vanity average is a number that went up. It tells you something changed; it can’t tell you your coaching changed it. A causal measurement compares the same behaviour after coaching against an untrained-peer baseline: comparable agents who weren’t coached on that gap. If the coached group improved and the peers didn’t, your coaching caused the gain. That difference, not the headline average, is the result worth reporting.
Why a number that went up proves nothing
An average is the sum of everything that happened, and most of what happened had nothing to do with your coaching.
Your call mix got easier when the renewals spike passed. Two recent hires joined your team and started scoring well, while your three weakest agents left and were replaced by quicker studies. A noisy product launch settled down, and the angry calls stopped. Every one of those moves the average up, and not one of them is your coaching programme. Run the numbers in a quieter quarter and the line goes up on its own.
So a rising average isn’t lying to you. It’s answering a different question from the one you asked. You asked “did my coaching work?” The average answered “did things get better on aggregate?” Those come apart all the time, and the gap between them is invisible on the slide. The number looks the same whether you earned it or inherited it.
Training evaluation has had a name for this missing step for over sixty years. Donald Kirkpatrick’s model, the standard framework since 1959, puts business results at the top as Level 4 — and Level 4, the part that asks whether the training caused the result, is the level organisations routinely skip. Not because they’re lazy, but because isolating the training’s contribution from everything else moving at the same time is genuinely hard. Most programmes stop at “people attended and liked it,” because that part is easy to count. The part that justifies the budget is the part nobody measures.
The worse problem: your before-and-after is inflated
There’s a second issue, and it’s sharper than drift, because it pushes the number in a predictable direction: up.
You don’t coach everyone. You coach the agents who scored worst on the behaviour you care about. That’s the right thing to do operationally, and it’s also the exact setup for a statistical trap.
A low score is part skill and part luck. The skill is real and it persists. The luck — which calls happened to land in the sample that month, a bad afternoon, a run of genuinely difficult customers — does not. Measure those same agents again next month and the unlucky scores drift back up towards each agent’s true level on their own, with no coaching at all. You selected the group precisely because their scores were low, and low scores contain more bad luck than high ones, so the group as a whole was always going to look better on the next measurement.
This means that when you measure your coached agents against their own past, part of the improvement you see is just the bad luck washing out. You will record a gain whether the coaching worked, did nothing, or made things slightly worse. The method manufactures a result.
This isn’t an obscure effect or a new one. It’s regression to the mean, and Campbell and Stanley catalogued the uncontrolled before-and-after as a standard threat to the validity of any result back in 1963. The fix they prescribed then is the same one that works now. And it has an uncomfortable implication worth saying plainly: every vendor before-and-after that coaches the low performers and then compares them to their own starting point is reporting a gain inflated in a known direction. If Future Ready measured that way, ours would be too. So we don’t.
What a causal measurement actually is
The fix is old and boring, which is usually a sign it works. You need a comparison group.
Build an untrained-peer baseline: a set of comparable agents, matched on the things that matter — similar tenure, similar call types, similar starting scores — who were not coached on that specific gap. Measure the same behaviour in both groups, before and after. The coached group changes. The peer group changes too, carried by all the same drift and the same regression to the mean. The causal gain is what’s left when you subtract one change from the other.
If both groups rose by the same amount, your coaching added nothing on top of what would have happened anyway, and you’ve found that out cheaply instead of paying for it for another year. If the coached group rose more than the peers, that extra is real. It’s the part you caused, and it’s the part you can defend. Researchers call this comparison a difference-in-differences; you don’t need the term, you need the move. The change in the people you coached, minus the change in the people you didn’t.
A word on honesty, because this is the post the rest of our numbers lean on. This is not a randomised clinical trial. You’re working with a live operation, not a laboratory, and where you can’t randomise who gets coached you match peers as closely as the data allows and you say so. A method you’re candid about the edges of is more trustworthy than one sold as airtight. The point isn’t perfection. It’s that a matched-peer comparison removes the two biggest reasons a before-and-after lies to you, and a raw average removes neither.
Why this matters more once AI agents are in the mix
The engineers shipping your AI agents already work this way.
When a disciplined team changes an AI agent’s prompt, they don’t read the next day’s numbers and declare victory. They run the new version against the old one on comparable traffic, change one variable at a time, and attribute the lift to the change or to nothing. Alter the prompt and the tools together and you’ve proven nothing about either, so they don’t. The whole practice is built on the assumption that an uncontrolled before-and-after is worthless, which is exactly the assumption contact-centre training has never made.
For decades, coaching has been evaluated with the same uncontrolled before-and-after those engineers would reject for a one-line prompt edit. The bar that AI engineering takes for granted is the bar human coaching never had. Now that you’re running people and AI agents side by side, the sensible thing is one causal standard across both: the same quality bar, measured the same way, whoever or whatever handled the call.
It’s also the reason that, when we wrote about how to tell whether your AI agents are actually any good, the fourth question to put to any vendor was whether improvement is measured causally against a baseline, or is just an average that happened to drift up. This post is the long answer to that question. The same trap catches a claim about an AI agent as easily as a claim about a person: target the weak spot, re-measure against itself, and the number flatters you whether you fixed anything or not.
The number that survives the CFO
A causal gain does three things a vanity average can’t, and each of them is worth money.
It survives challenge. When the CFO asks how you know, you have an answer that doesn’t depend on trust: here is the group we coached, here are the matched peers we didn’t, here is the difference between them. That’s a number a finance team can interrogate and still believe, which is the only kind worth putting in front of a board. It’s the same evidence standard that lets you show a board the outcomes actually improved rather than asserting it.
It tells you what to stop doing. A vanity average flatters every programme equally, so you keep funding coaching that does nothing, because the line went up after all of it. A causal method shows you the coaching that moved your coached agents no more than the untrained peers, and lets you redirect that spend somewhere it works. Knowing which half of your coaching is wasted is worth as much as knowing which half pays.
And it keeps your headline number honest. When Future Ready quotes an average 3.7x return, that figure is a causal gain measured against a baseline, not a before-and-after we let drift in our favour. The discipline is the reason the number isn’t bigger. It’s also the reason you can take it to your board without it falling apart under the first hard question. A larger figure built on an uncontrolled average would look better on a slide and mean less in a budget meeting.
This is the third step of the closed loop, the one most tools quietly skip. Score and coach are visible and easy to sell. Re-measure causally is the unglamorous step that turns the other two from activity into proof. Leave it out and you have a quality programme that feels productive and can’t show its work. Put it in and every other number you report inherits its credibility.
A rising average tells you the weather changed. A causal gain tells you that you changed it. Your CFO already knows the difference. The only open question is whether your quality programme can show it.
Want to know what your coaching is actually worth? Tell us the behaviour you’re trying to move, and we’ll show you how we’d baseline it — coached agents against matched peers — so the number you take upstairs is a result, not a coincidence.