Slower Feels Smarter? Experimenting with AI Agent Latency

Conventional wisdom says speed matters in software and that fast is always better. But in testing our AI agent, we found that slowing down might actually make it feel smarter.

Pedro Tabacof

2025.04.10

“Slow is smooth, smooth is fast.” -Navy SEALs saying

Nobody likes slow software. Despite Moore’s law and improving hardware, software doesn’t seem to be getting faster over time. Sometimes it feels like things are even slowing down due to increasing bloat. Software latency isn’t just a user annoyance, there’s also documented evidence of real business cost:

Google found a 0.5 second delay caused a 20% decrease in repeat traffic, that persisted after the delay went away¹
Amazon found every 100ms slower the site loaded, they lost 1% in revenue²
Walmart.com found a very large change in conversion rate based on page load times³

Does the same apply to AI agents for customer support? Is latency just annoying, or also harmful?

As LLMs inherently do so much computation – which takes time – and as there are so many different LLMs available, with different latency and quality tradeoffs, we needed to know.

The only way to answer that important question is through an AB test, as the natural latency variation is confounded by many factors (e.g. peak time has higher latency then early morning, but the user types and queries in those periods are different too).

At Intercom’s AI group, we run hundreds of AB tests per year. Almost all of them are to evaluate improvements to our AI Agent Fin, but we made an exception here: To understand the impact of latency to Fin, we ran an experiment where we increased latency in a small amount of conversations.

We faced two difficulties:

We cannot artificially decrease latency, only increase it: if there were any latency reduction low-hanging fruits, they’d have been picked already!
We cannot harm the end user or customer experience: the experiment must be imperceptible to them

We managed to implement an experiment design where the changes were indeed imperceptible. The experiment also uncovered a non-obvious truth:

As we expected, latency increases deflections
Unexpectedly, latency also increases positive feedback without harming negative feedback or CSAT

Those results were so surprising that we had to run confirmatory tests. Here is a summary of the 2 more relevant experiments:

Despite those results, since then Fin has become faster, not slower: Despite our findings, we still want to speed Fin up! Even if the latency doesn’t seem to negatively effect the business outcomes, we want to make the experience fast. Nonetheless, these results are valuable – not least because they have changed the way we run AB tests at Intercom, to account for latency as an important confounder.

Before exploring the results and take aways, let’s dive into the methodology first.

Experiments and methodology

Due to the nature of the experiment and to make it blind, we had to design it in a way that it wouldn’t be noticeable to both users and customers:

We first measured the current latency percentiles and made increases that would be consistent with the already-existing latency. For example, at the time of the experiment, the P99 latency used to be 20 seconds over the median, so the variant that increased latency by 20 seconds was just 1% of the conversational volume.

In other words, as a user, you wouldn’t notice an increase in latency, unless you were unlucky enough to be in the natural 99 percentile and fall in the 20 second delay experiment arm, a tiny 1 in 10k chance!

During the time of the experiment, we didn’t receive any latency complaints that could be attributed to the experiment, indicating that the methodology was a success in terms of hiding the experiment effects from users and customers.

We randomized the experiment at the conversation-level⁴ with 1 control and 4 treatment arms roughly mimicking the natural latency variation:

Control / no latency increase: 90% of conversations
5s delay: 4% of conversations
10s delay: 3% of conversations
15s delay: 2% of conversations
20s delay: 1% of conversations

As the results were so surprising, we ran two more confirmatory experiments:

We targeted our previous generation Resolution Bot, which is much faster than Fin as it doesn’t use AI/LLMs, thus covering the shortcoming of not being able to decrease latency
One unrelated experiment was naturally decreasing latency by 2 seconds, so we added one more arm there to compensate for that latency decrease

All 3 experiments found similar results.

Metrics

When running AB tests, we monitor it through a dashboard that covers dozens of metrics. Here are the most important ones:

Resolution rate: our key business metric (we charge customers by resolutions)
Volume: to ensure the experiment has enough statistical power
Sample ratio mismatch: to ensure assignments and randomization is working as expected
Confirmed resolution rate: resolutions with positive feedback
Positive / negative feedback rates: users reacting to “did that help?” with “yes” or “no” respectively
CSAT score: customer satisfaction survey sent after the conversation is closed
Latency percentiles: both time to first token and total duration
Cost per conversation in $
Error rate
Interaction analysis between other concurrent experiments

Those metrics help understand the impact of each experiment under many different dimensions. They allow us to judge whether the experiment was well designed and implemented, has enough sample size, and its impacts on business, users, and engineering metrics.

We will expand on our experimental methodology in future posts.

Results

“Any figure that looks interesting or different is usually wrong” -Twyman’s law

Before we get to the actual results, here is what we expected to happen from first-principles:

Deflections and assumed resolutions will go up due to more users bouncing off
Confirmed resolutions, positive feedback and CSAT score will go down due more users getting annoyed and fewer getting to the feedback stage

Here is what we actually saw:

Overall resolution rate

The overall resolution rate increases by up to 2 percentage points (pp):

To understand what is going on, we need to break resolution down into its two components:

Assumed resolutions: When users engage with the AI agent, receive an AI answer with sources, don’t leave positive or negative feedback, and don’t escalate to a human support agent
Confirmed resolutions: Same as above, plus the user needs to leave positive feedback to the last AI answer (e.g. saying “that helped 👍”)

Assumed resolutions

As expected, assumed resolution is the main driver of the increase: we expect users to drop off more, leading to more such resolutions when AI answers are delivered⁵.

Confirmed resolutions / positive feedback

Surprisingly, confirmed resolutions, which are resolutions with positive feedback, seem to actually increase by 0.4pp for higher latency increases. This was the most puzzling result – a Confirmed Resolution is the result of an end user explicitly clicking a button saying “that helped 👍”, and we use this as our north star. As such, this should only go up if the user was in fact more satisfied with the AI answers. It was this puzzling result that led to further analysis and experiments.

Negative feedback

Also surprisingly, negative feedback is not increased and seems reduced for higher latencies. Note that negative feedback does not mean anger or dissatisfaction, only replying “No” after “did that help?”.

CSAT

Finally, we don’t see a clear impact to CSAT rating, but we do see an increase in response rate in higher latencies, another puzzling result:

Take aways

We saw expected results for deflections and assumed resolutions but not for the feedback associated metrics. Are users actually more happy with increased latency?

We brainstormed potential explanations:

The metrics are wrong i.e. the results are just due to some data artefact, in the spirit of Twyman’s law
There is a poorly understood psychological effect at play, not well documented by the literature

We ruled the first hypothesis by double checking the data, running confirmatory experiments, and conferring with our excellent AI infrastructure engineering team, owners of the AB test assignment logic.

Psychological explanation

“When you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth”. -Sherlock Holmes

Our best hypothesis is a psychological explanation: The longer users wait, the more human or effortful the reply becomes in their minds, leading to improved answer perception. We have some corroborating evidence for that hypothesis:

An experiment which provided detailed status on Fin “thinking” slightly decreased hard resolutions, showing user perception of how Fin works impacts their feedback. In this case, by making Fin more transparent, we slightly hurt the positive feedback rate.
We have received complaints that Fin over email is “too fast”, as it seems suspicious to get an almost instant response for email queries.
An independent AI engineer formed similar conclusions here: Improved chatbot customer experience: sleep() is all you need.
An independent designer also shares similar conclusions: Why Users Love ‘Thinking’ Chatbots: The Use of Delays in Conversational AI:
The Labor Illusion, as discussed in a previous post, is a term coined by Harvard researchers. This behavioural economic concept posits that users tend to appreciate a service more when they perceive that effort is being expended on their behalf. In the context of chatbots, delays in response times simulate the effort of a human agent ‘thinking’ through the customer’s query, creating an illusion of diligence and attentiveness.

If true, a question we naturally might ask is, how else can we make an AI agent be more humanlike whilst remaining within legal⁶ and moral limits? That has led to other interesting experiment ideas:

Changing “thinking…” into a typing indicator
Removing unnecessary references to “AI agent” in the UI
Adding an acknowledgement message (“let me look into Fin pricing for you 🔎”) while Fin searches for answer

While some of those experiments are still ongoing (more on them in future posts), all have promising results, so we are even more confident that the psychological hypothesis is the correct one.

Latency as a confounder

Another takeaway is that latency is a confounder for any other experiment that accidentally increases or decreases latency.

In other words, if an experiment testing something else happens to increase latency by 10 seconds, it might show positive benefits solely due to latency. Conversely, an experiment that reduces latency might decrease resolution rate solely due to latency reduction.

That means latency needs to be corrected for. For experiments with increased latency, we should remove the resolution rate gains as shown above. For experiments with decreased latency, we need to add another variant with latency corrections, resulting in an ABC test:

A: control arm
B: treatment arm
C: treatment arm with same latency as control arm

This effect also explains some historical results. For example, the biggest ever Fin latency reduction required multiple iterations not to harm Fin metrics. At first, we thought it was due to the quality of the original prompt, but now we realise some of it was just due to latency. Meaning, we underestimated how good the change was as the only reason the resolution rate didn’t go up was due to the latency reduction.

Fin latency today

Despite those results, since we ran that experiment in late 2024, we have made good progress in reducing Fin’s latency, from changing the way we serve Claude Sonnet models to a new more streamlined architecture, where we rely on eager execution to save precious seconds.

As of March 14, 2025, Fin is at its fastest ever: The median time to first token is now 7 seconds, a 60% reduction from its peak.

Why are we making Fin faster, given the experiment findings? While we’re confident in the findings, we know latency effects go beyond just end user psychology. We must also serve our customers, who are actually paying for Fin’s resolutions. While they might be glad of a marginal resolution rate increase, they’d much prefer for their users to have the best possible experience.

Fin is now the first-line of support for many of our customers, which range from Anthropic to Clay to Culture Amp, and is responsible for over half of all their resolutions on average, meaning Fin is an integral part of their customer support operations.

From that prism, it’s unacceptable to have a subpar experience even if it doesn’t harm end user metrics. We go back to the original opener of this post: nobody likes slow software. We want Fin to be fast and to be perceived as such, as we want Fin to be the best AI agent in all dimensions.

You should expect even more latency improvements to Fin in 2025.

Marissa Mayer at Web 2.0 ↩︎
Amazon study: Every 100ms in Added Page Load Time Cost 1% in Revenue ↩︎
Walmart slides ↩︎
We typically randomize AB tests at the conversation instead of the more typical user-level. There are pros and cons to each alternative, but here the decision was clear: We didn’t want to harm the end user experience, so randomizing by conversation gives the end user opportunity to get support at natural latencies most of the times. ↩︎
AI answer delivered does not mean it was read or seen by the end user! ↩︎
California’s SB-1001, the Bolstering Online Transparency Act, prohibits bots from communicating online to mislead about their artificial identity for commercial transactions or influencing votes, requiring clear disclosure to California consumers, with enforcement by the state attorney general and fines up to $2,500 per violation. ↩︎

About the author

Pedro Tabacof is a Principal Machine Learning Scientist at Intercom. With over 10 years of experience in the field, Pedro has helped scale three 'unicorn' startups: Wildlife Studios, Nubank, and iFood.