Building reliable large language model (LLM) inference is still an emerging discipline. Although the field has matured considerably in recent years, we are far from the level of dependability seen in industry-standard services such as Amazon S3. For now, anyone aiming to use the best model available must remain vendor-agnostic and recognise that reliability varies among providers.
Intercom’s customers depend on us for continuous availability. With our AI Agent, Fin, resolving up to 70% of our customers’ conversations, an outage can overwhelm their support teams. To emphasise how important Fin’s uptime is to us, we promise a Service Level Agreement (SLA) with a monthly uptime target of 99.8% for Fin.
Achieving this requires robust systems and practices designed to deliver reliability atop imperfect foundations.
Sophisticated Routing Layer
At the heart of this reliability is a sophisticated LLM Routing layer that decides how an LLM request is handled. Each Fin request uses multiple models. We define all the “routes” for each model in a format that looks something like:
{
CLAUDE_SONNET_4: {
us_prod: {
conversation_serving: LatencyBased(
GoogleProvider.us_east5(),
AnthropicProvider(),
BedrockProvider.us_east_1(),
),
otherwise: LoadBalanced(
AnthropicProvider(), BedrockProvider.us_west_2()
),
},
test: Sequence(AnthropicProvider(), BedrockProvider.us_west_2()),
}
}
These routes let the system know what vendors and regions are available for each model. This setup enables flexibility in routing logic and failover strategies.
Key features of the routing system:
Cross-vendor failover
We maintain vendor redundancy for all models. For example, Anthropic models can be served by AWS Bedrock, GCP Vertex, or Anthropic’s own infrastructure; OpenAI models are served by either Azure or OpenAI itself. If a vendor experiences an outage or degraded performance, our system automatically shifts requests to another, maintaining both streaming and non-streaming operations.
Cross-model failover
Sometimes a new model that we started using for Fin might only be available on one vendor. Or, we might be serving a model using two vendors that alone don’t have enough capacity to handle all of Fin’s load. In this scenario, if one vendor had an outage, the second vendor would not be able to serve all the requests successfully.
And, in the rarest of rare cases*, all the vendors that serve a particular model might have an outage at the same time.
For all these cases, we can also failover across models. So if Sonnet 4 is unreachable on all vendors for any reason, we can send the request to a similarly capable GPT model. Similarly, we can partially failover some of the requests to a different model if we don’t have enough capacity for it on the available vendors.
* rare, but not impossible. Many times there is a sort of co-dependency between the vendors and an issue at the wrong level can end up impacting services on multiple vendors.
Latency Based Routing
We support different modes of routing. Sometimes because of capacity or performance constraints, we might send only a particular percentage of requests to a vendor.
The most interesting mode here is the “Latency Based Routing”. Performance can fluctuate between vendors throughout the day. We monitor real-time response times and route more traffic to the fastest available vendor, taking into account each vendor’s capacity. In practice, choosing the fastest route can mean a difference of several seconds per request which is critical for end-user experience.
Capacity Isolation
A classic failure reason for systems that share the same underlying resources is that a less-important usecase (like an asynchronous task for translating conversations) could end up exhausting this common resource, impacting more important usecases. In our case, Fin is the most important usecase we want to keep serving.
To achieve that, our routing framework lets us define separate pools of capacity for Fin vs everything else. Each pool only has access to certain vendors or regions of the vendor. This isolation prevents Fin from ever being impacted by a less-important usecase.
If Fin’s assigned pool is exhausted, Fin can draw from other capacity, but lower-priority uses are never allowed to encroach on Fin’s pool.
Operational Safeguards and Monitoring
We have protections and processes in place to ensure the system does not diverge from its current reliable state and can withstand Fin’s exponential growth.
Single Point of Failure reporting
We actively track each model and its setup for redundancy and capacity isolation. If any model falls short, we generate a high-priority alert and resolve the issue immediately.
Noisy Neighbor Protection
Noisy Neighbor is a well understood problem in multi-tenant systems. Our protections ensure a single customer or process cannot monopolise resources to the detriment of others.
Load Testing
We regularly perform load testing to ensure our systems can proactively support growth in Fin’s usage and maintain performance under heavy demand.
Maintaining Buffer LLM Capacity through strong relationships
Reliable operation requires buffer capacity. Through strong relationships with major vendors like OpenAI, Anthropic, AWS, Google, and Azure, we maintain ample headroom, with the ability to handle two to three times Fin’s normal traffic at any point.
This buffer capacity is not easy to come by and we have our account managers to thank for championing Intercom whenever we need extra capacity to run Fin reliably.
Observability
Intercom has a strong observability culture. We have written before about improving our observability posture and how we like to focus on the customer outcome. Instrumenting every LLM call, we collect data on token usage, response times, and system load across vendors. These insights drive capacity planning, vendor selection, and rapid troubleshooting.
All of these sophisticated systems and processes ensure that Fin runs reliably, and the whole is greater than the sum of its parts.
Future
We will continue investing and making sure we can run Fin reliably as the demand for it grows. Here are some things we plan on taking up in the future.
Request Prioritization
The current architecture protects Fin from other usecases exhausting all the LLM resources, but many times Fin doesn’t need the capacity we have provisioned for it. In these cases, this extra capacity could be shared with other usecases which otherwise run constrained. This can be achieved by assigning a priority to each LLM request, and then dropping the lower priority requests when LLM capacity is constrained. Such a solution would make sure Fin can use all the capacity it needs without constraining the other usecases when there is no need for it.
Chaos Monkey
So far we haven’t needed this as regular vendor outages keep our systems well tested. But as vendors grow more reliable and as we move towards more automated failovers, short lived outages can go undetected. This can give us a false sense of security which makes it important to regularly test our systems by causing controlled outages and ensuring Fin won’t be impacted by a particular vendor or model going down.
Exploring using an off-the-shelf proxy for routing
Fin’s reliability and elasticity is a competitive advantage for Intercom, so we are happy to lean towards “build” in the build vs buy decision. This was more straightforward when we started building our routing framework as there weren’t many options available that provided the flexibility we needed.
But tools like LiteLLM and Bifrost have covered a lot of ground since then. While we are proud of what we have built, we don’t want to maintain a system if we don’t need to. We will still need to make sure we don’t introduce a new single point of failure by using these tools, but they look promising and are worth exploring.
Ultimately, the real measure of our work is that Fin stays available to solve problems for the people who depend on it. The work is ongoing, and reliability is never finished, but with these systems in place, we ensure Fin remains a reliable part of our customers’ workflows.