The single biggest transformation we’re seeing in Customer Service is the transition from Support Leader to AI Support Leader. There is a lot going on in this, it’s not as easy as updating your LinkedIn title and asking for a payraise (though be my guest), there are new roles in your org (e.g. who designs AI workflows, who builds your Fin Tasks etc.) , and new priorities (your help docs are suddenly mission critical infrastructure).
As AI Happens™️ to every industry, we see the same story. It starts with augmenting your team and Copilots while everyone tells themselves nothing will really change (aka the Cope-pilot stage).
It quickly upgrades to Agents that actually do the work, and once that starts people have lots of questions, questions like…
- How do I measure if the work is getting done (that’s what this post is about)
- How quickly will this happen? (this post on timelines)
- Shouldn’t we build our own Agent? (sure thing, you’ll be back though)
- If >50% of the work is now Agents, what are my new metrics?, (this post on metrics)
- What’s my org chart look like in a world where 1 Agent is doing half the work? (this post is to follow)
- What are the new roles & responsibilities here? (to follow)
- and a whole heap of other questions yet to emerge (let us know what you want answered)
We have so much to say about this transformation, the new org charts, the new roles, and we’ll get to it all
For now though there’s two things everyone needs to understand, specifically about evaluating different agents based on AI Resolutions, so let’s start there:
Note: Resolution Rate means "Of the questions the AI Agent is involved in, what percentage does it resolve". The points I'm making in this post are about comparing agents against each other given the same setup.
1
1 – Every percentage in resolution really matters
If one agent costs 99 cent a resolution and claims to do 72% of your volume (might sound familiar) and another can do 35% but is absolutely ✨free✨, you might auto conclude “we’ll go with the free one“. And who’d blame you? No need to talk to your CFO, no need to load up your procurement tool, just pull the trigger right?
The maths you’re intuitively doing is something like this
My Conversation Volume × 72% × $0.99c = $$$
versus
My Conversation Volume × 35% × 0.00c = $0
It’s easy to pick the winner, but you’re playing the wrong game.
The right game is to talk about Total Cost Of Running Support. You need to ask yourself “What’s happening to the queries that the AI doesn’t handle“
The answer is humans, humans are happening to them. This means you’re asking your support team to shoulder the burden of these queries when in reality there’s loads more valuable work for them to do, and you’re still growing your team’s size as your business (and thus support queries) scales.
The fully loaded cost of a human handling a support ticket is something approximating the % of their fully loaded salary + all associated costs for the person. So for easy maths let’s say your support person is $4k a month, and handles 25 tickets a day, with 20 working days in a month that’s $8 per ticket. Obviously this ignores a load more complexity, e.g. have you heard of Payroll taxes, health insurance, benefits, equity, laptops, software licences per seat (Slack, Zoom, Zendesk, G-Suite), management overhead, and that’s before we get into office spaces etc. Let’s be optimistic and say the $8/ticket turns out to be more like $10/ticket.
Okay, so at $10 per human support resolution, and a choice between 99c and 0c for AI resolutions, how does the maths play out if you have, say, 100,000 conversations per year? The answer is like this…

? The job of an AI Support Leader is to minimise human handover, not to minimise dollars spent on AI resolutions.
2 – The hardest percentages matter the most
Most queries to your team are not the basic “how do i reset my password” ones. They’re messy ones that require information from multiple sources to answer. As we discussed in Good Bot / Bad Bot, it’s important your agent can do the hard stuff.
But your AI Agent also needs to move past informational queries (e.g. ones answered through text alone) into personalized queries (unique to the user) and into action-based queries (e.g perform an action in another system).

These are often a smaller percentage of the total volume, but they’re a larger percentage of the total time spent by your team. This is why we built Fin Tasks – we know that to really deliver on the promise of AI Support, we have to complete the messy harder queries end to end, to leave no crumbs as the kids say. (I know, I can’t believe I wrote that either)
The right way to think about these less frequent but more painful queries is frequency × handling time. Which looks more like this…

? The job of an AI Support Leader is to minimise the time wasted on repetitive actions, not just automate extremely frequent questions.
Ultimately the first step when moving to AI CS is picking the best agent for your business and it putting it live. The second step is ensuring that you’re minimising handovers and minimising repetitive schlep work. If you get those two done, you’re ahead of the majority of your competitors, but you’re not done, not by a long shot…. we have so much more to show you. Stay tuned.
- When you’re optimising your agent, once you’ve picked one, your actual goal is just “total resolutions” (in the same way that when you’re evaluating a website design you care about conversion rate, but once it’s live, you care more about total conversions) ↩︎

Reality: The vast majority of the Chat bots on the web are total dogshit. That’s probably an insult to dogs, and maybe even to their shits tbh.
Brands somehow mindlessly throw a half-trained ugly monstrosity on their site, dust their hands off, and claim to have Joined The AI Revolution™️.
What’s going on? What is causing so many otherwise great companies to totally drop their guard and ship stuff that’s just awful for their customers? Especially now, given that AI Agents can be actually good. We see Claude and its rivals handle messy complex queries all the time, so why are these CS Agents still so derpy?
The answer is simple but sad.
Companies often just don’t know the difference. They don’t know a good bot from a bad bot. When they’re buying on they rely on extremely simple evaluations: “I asked it how to reset my password, and it got it mostly right…”. That’s true, but it’s not a sufficiently hard test. If Einstein and I both sit my daughters first grade arithmetic exam, you will be shocked at how close I am to Albert Einstein. Shocked!
This is one of our failings in the customer service industry, we need to help people evaluate an agent so they can really see the difference. At Intercom, once a customer deploys Fin, vs anything else, they get it pretty quickly (even when we let them down), but it’s hard to see these differences a priori.
So here, my dear reader (and also my LTV:CAC positive prospects nurtured by Marketo, you are dear to me too), let me start the ball rolling by explaining a few
A good bot easily answers hard questions, a bad bot barfs on them
To the Einstein point above, you won’t know a good one from a bad one until you ask it a hard question. But “what is a hard question” is, actually, a hard question itself (how meta). Here’s how I see it…
A simple question usually has a 1:1 mapping with some help sentence easily found in an article and is often common. e.g. asking a project management app “how do I start a project”.
A “hard question” needs lots of information from lots of sources, the user’s current state, the screen they’re on, information from multiple internal-and-external docs, the previous answers given by support reps and more. Reality is messy:
e.g.“Why can’t I see the SSO button on the mobile app?” could be because SSO isn’t enabled, or because the user isn’t on the enterprise plan, or because you can’t access SSO while on trial, or maybe it’s a recent bug, or maybe the user is wrong, they just need to scroll-the-fuck-down and the support team are sick of explaining that thankyouverymuch, and it can even be two or more of these things together. Reality is messy. Simple bots can’t handle this.
A good bot lets your users express themselves quickly, a bad bot is just buttons all the way down…

In tech we’ve gone from the command line (efficient & arcane) to GUI (easy but slow), to Superhuman-style Command K (efficient & more discoverable) and now with AI Agents we’re on the brink of actual “text UI” (aka AI UI). Just type (or say) the thing you want! Each of these has their merits, yet somehow bad bots can still default to the worst of all cases. Good bots capture all the context and have no amnesia.
A good bot ‘does the thing’, a bad bot ‘tells where to go to to find out how to do thing’

This is self explanatory but if you’re looking to delight your customers then you do the thing they want. Good bots take actions and follow processes, bad bots hand over all that complexity back to the user. When a bot actually solves your problems, you’re way more inclined to use it again & again.
A lot of our competitors like to say wild proclamations e.g. “we do 97% of your volume” , “we do 99%”, “yeah, well we do 107%, we actually ask your customers questions” (that sounds like a silly idea, it’s not, more on that another day…). Anyone who has worked in support knows these numbers are often bullshit. In the previous era of bad bots, what it meant was “yes we can tell your users how to reset their passwords, but everyone else we just kinda frustrate & deflect“.
A good bot knows how and when to escalate to humans, a bad bot is ‘always or never’, or worse a “bot jail”
As the now popular blog post says Reality has a surprising amount of detail. This is what makes 100% extremely unlikely. Here’s one little example: Last week I jumped on a call with a Fin customer looking to use our actions feature to handle all their “change name on utility bill” type queries. My confidence lasted all of 3 minutes, when I listened to phone call #2 which began with… “Hey, so, I’ve recently gone through a divorce, the bill is in both our names, but it’s his credit and he moved out so I now need to separate the bill, and change the credit card” (details changed, but you get the idea). While Fin can definitely handle a simple or messy name change and even a multi-party request, Fin isn’t touching that one. Nor should it. Fin hands that one over. There are times when humans need humanity. The support rep did an amazing empathetic job of it, credit to them.

Humanity aside, It’s also the case that sometimes no matter how clever the agent, some human approval step is always needed, which is why you’re best off designing for that scenario too.

To paraphrase Captain Barbossa (yes, an obscure one I know), “Ya best start believing in human handover, because you’re living in it”
A good bot follows your unique policies and guidance, a bad bot thinks you’re identical to every other company

How, when, where you handover, how you speak about customers (are they guests, patrons, passengers, investors), what words you should never say, or always say, etc, are all unique to your company, your brand, your business. You need the ability to control it all. A bad bot is designed for some extremely abstract “business<->customer” relationship and/or is editable only by forward deployed engineers, meaning every time you update a policy or product name you’re waiting for someone else to tinker with your black box bot product.
Remember your bot is supposed to work for you, not the other way around. If you can’t contol it, it’s not your product.
A good bot speaks in your tone of voice, a bad bot is always ‘California chirpy!’
At Intercom we have banks (the old school kind), fin tech companies, surf shops, weed shops, law firms, security companies, and even a funeral arrangement service. As you might guess, they speak to their customers in… different tones.
Even the little phatic responses are important here, e.g. it’s very tempting in LLM Latency Land to hack in immediate replies like “Sounds good”, or “Awesome, I’ll start looking into that, hope you’re day is going well”, but you’ll seriously upset someone reporting bad news, or you’ll ruin the chill vibes of a weed shop. Reality is messy. You want control here.
A good bot is multi-modal, a bad bot can’t see your customer’s screenshots or photos

Sometimes a picture will literally paint a thousand words, especially in technical situations where one screenshot explains everything. Most bots can’t handle these scenarios and file a ticket somewhere on the back end, good bots reply with the description of the problem and the steps to resolve. This might seem like a nice to have, but we’ve seen our multi-modality used to do things like appraise a damaged delivery, walk through how to reset a router, and debug an error message in the product. Each case is a delightful experience, customers often don’t believe it’ll work. It works.
A good bot workflow is easy to build, a bad one is an IKEA manual directed by Christopher Nolan

Every bot needs workflows to handle specific scenarios, but managing them can get unwieldy. Generative AI simplifies all the if-this-then-that logic and lets you program your agent using English, e.g. you can give Fin a tasks saying “we need to gather the user’s order_id, email, and phone number” and Fin will do all the smart things (e.g. if the user has already offered them it won’t re-ask, it’ll only ping for what it needs, and it’ll progress as soon as it has it). You can see an example here in the tasks section
In old-school bot land that’s good ol’ boxes & arrows territory, just for one single step, add enough of them and the inception music starts playing….BWARRRRMMMMM
A good bot sees opportunities for upsell or other ways to be helpful, a bad bot craps without any follow up
The traditional bot experience ends abruptly using internal lingo, like “I am marking this conversation as closed”, and often just disappears or greys out all its buttons. A good bot will still be available and identifies opportunities for the business, for example “if the customer is happy, and is still on a free plan, see if they’d be interested in trialling a paid plan”. This is perfectly logical business behaviour, but not possible unless you’re using good software.
A good bot is something you can update as your business scales and matures, a bad bot has you going back to the vendor for every single tweak
Sadly most AI CS products are a black box of “magic” that you can’t control, interrogate, look into, or learn about. The answer to so many questions will be “talk to your forward deployed engineer” and that can be frustrating, when you’re trying to improve your support, shipping is your heartbeat – if every tweak is a roundtrip through an engineer, you’re just gonna stop iterating and start accepting weakness. Soon enough your bot is out of date and pissing people off.
In summary
If you’re hiring a AI Agent to join your support team, think about it like this:
A good bot should shoulder 50+% of the work, should do it at roughly equal CSAT to your team, and should help your customers, which will help your agents, and in turn help your business. In short, it should do a lot of work to a high standard in a way that delights your customers. This means it…
- Answers hard questions, and actually completes tasks for you end to end without your intervention
- Takes care to understands your customers, their context, their messages, their images, and doesn’t make them repeat themselves for no reason
- Is easy for you to build, control, guide, and maintain without ever talking to the developers, it’s your agent, it’s not on loan to you.
- Ultimately makes your support better, and faster for you and your customers.
You’ve probably heard a lot of the threadBois talk about the word “agentic”, ultimately this is a lot of what they’re talking about: can it do the job reliably without babysitting?
A bad bot is nearly the total opposite – it might do a lot of work, potentially more than you’d want it to, annoying a lot of customers in a way that’s a massive cost to your CS team and leaves a lot to clean up for your team and for your brand. Caveat emptor.
I hope I’ve helped you spot some of the main differences, it’s a wild place out there, lots of bad bots and a few good ones, take care!
“It’s only prompt-engineering if it comes from the O1 region of Cerebral Valley, everything else is just sparkling specificity”
For as long as we’ve had prompts, we’ve been told that prompt engineering will go away in the future. It’s one of those statements that I believe to be true, but also, one that gives everyone lots of false predictions. “Soon we’ll all be able to make apps with just a few sentences” we’re told. I mean, sure. Back when I was a consultant I used to remind clients “Listen, if we remove quality as a prerequisite you’ll be shocked at what we can do quickly and cheaply“.
The thing is that reality has a surprising amount of detail. Pick any app category and you’ll find there’s a shocking amount of slightly different apps. Take something as banal as fasting, i.e. the act of “not eating”, there’s hundreds of “fasting apps” and these are literally just apps that you run while you’re not eating. (granted you screenshot them for the ‘gram, but you get the point)

Similarly and more to the point there’s hundreds of to-do apps in the App Store today.

Todo is perhaps one of the domains that’s both easiest to describe, and most over-fished, but still every month another entrepreneur throws their hat in the ring. (I’m often told that Productivity apps is where good founders go to fail second time around)
Every one of these apps has an angle, the founder believes their workflow or habit should be encoded, specifically, as it’s what makes them distinctly productive. Some apps have due dates, some don’t. Some have binary states, some custom. Some are visual lists, some are Trello boards. Some nag you every day, some never bother you. Some have daily limits, and categories and tags, and some scream “No! simplicity is what’s most important”. There is a lot of depth to a surprisingly simple thing. This is all not to mention more aesthetic/superficial things like brand and UI, which can’t be discounted (we could do with some original app icons though).
So why am I saying all this? Well it comes down to this…

A lot of people confuse what they’ve heard of as “prompt engineering” with actually making important decisions regarding what you’re trying to do. Information Theory 101 says that if you want specifics in the outputs, you need them in the inputs.
So while prompting tricks (e.g., ‘Reflect on the 5–7 things you got wrong and think about how you’d make it right’—which, let’s be honest, sounds exactly like my Saturday morning after a big night out, maybe we really are all token-completers) will no doubt fade away, we will still need to know how to speak and clearly detail all our opinions and our tastes. And not to be all Rick Rubin but the premium on taste will definitely go up. In a world where everyone can get the first 70% of their app built in 20 seconds, the last 30% really really matters, so we’ll need to get really good at detailing specificities.
“Why can’t I just compile my pseudo code?”
Sidenote: This whole debacle has been an interesting flashback to my life as a university lecturer where (aside from recursion + pointers) the most common question you’d get from students was “Why can’t I just compile my pseudo code, why do I need all this public static void main String args nonsense”
Messy language choices aside, the answer is at an abstract level the exact same. The amount of information extracted from a system is limited by the amount of information fed into it, and abstracting away choices just limits the output range which ultimately limits your ability to program. The syntax forces the specificity: Print? Print where? Ah to System.out. Did you want this on a line by itself? Okay so it’s System.out.println, well why didn’t you say? End of sidenote, and thanks for staying with me 🙂
Motivating Specificity with UI
With Fin when we were building our Guidance feature, one thing we realised was that we needed to (ahem) “guide” our users into saying useful things, here’s the interface we ended up building. You’ll be shocked to hear that a lot of what we’re doing here is just forcing our users to be clear about how they want their bot to behave, by asking questions.
I suspect we’ll see that a lot in the near term, it’s not prompt-engineering, it’s about getting specificity from users who don’t read Hacker News and didn’t find the “walk me through it step by step” paper all that exciting

We had guidance in beta for a hot second when I read o1 Skill Issue a great article over all (on a similar theme) and found this visual

I suspect it won’t be long until the “text-to-app” products realise, as we did, that the best way to get specificity is to actually ask for it. What will that look like? Will it be a new type of PRD? An output from from a next gen product like ChatPRD?

It’s hard to say what this new type of product specification UI will look like.
What I can say for sure is the following: Adding friction to these products (in the form of asking for more than 1 sentence) will definitely reduce conversions, and ultimately mean less apps get created, but it will result in better apps far more likely to actually be unique, distinct, and what the user actually wanted to create. Which matters more, right? RIGHT? We’ll find out soon enough.