{"id":416,"date":"2025-08-08T16:51:00","date_gmt":"2025-08-08T16:51:00","guid":{"rendered":"https:\/\/fin.ai\/research\/?p=416"},"modified":"2025-08-08T16:51:00","modified_gmt":"2025-08-08T16:51:00","slug":"fin-running-a-reliable-service-over-unreliable-parts","status":"publish","type":"post","link":"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/","title":{"rendered":"Fin: Running a Reliable Service over Unreliable Parts"},"content":{"rendered":"\n<p>Building reliable large language model (LLM) inference is still an emerging discipline. Although the field has matured considerably in recent years, we are far from the level of dependability seen in industry-standard services such as Amazon S3. For now, anyone aiming to use the best model available must remain vendor-agnostic and recognise that reliability varies among providers.<\/p>\n\n\n\n<p>Intercom\u2019s customers depend on us for continuous availability. With our AI Agent, Fin, resolving up to 70% of our customers\u2019 conversations, an outage can overwhelm their support teams. To emphasise how important Fin\u2019s uptime is to us, <a href=\"https:\/\/www.intercom.com\/legal\/service-level-agreement\">we promise a Service Level Agreement (SLA)<\/a> with a monthly uptime target of 99.8% for Fin.<\/p>\n\n\n\n<p>Achieving this requires robust systems and practices designed to deliver reliability atop imperfect foundations.<\/p>\n\n\n\n<h2 id=\"sophisticated-routing-layer\" class=\"wp-block-heading\">Sophisticated Routing Layer<\/h2>\n\n\n\n<p>At the heart of this reliability is a sophisticated LLM Routing layer that decides how an LLM request is handled. Each Fin request uses multiple models. We define all the \u201croutes\u201d for each model in a format that looks something like:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">{\n    CLAUDE_SONNET_4: {\n        us_prod: {\n            conversation_serving: LatencyBased(\n                GoogleProvider.us_east5(),\n                AnthropicProvider(),\n                BedrockProvider.us_east_1(),\n            ),\n            otherwise: LoadBalanced(\n                AnthropicProvider(), BedrockProvider.us_west_2()\n            ),\n        },\n        test: Sequence(AnthropicProvider(), BedrockProvider.us_west_2()),\n    }\n}<\/code><\/pre>\n\n\n\n<p>These routes let the system know what vendors and regions are available for each model. This setup enables flexibility in routing logic and failover strategies.<\/p>\n\n\n\n<p>Key features of the routing system:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Cross-vendor failover<\/h3>\n\n\n\n<p>We maintain vendor redundancy for all models. For example, Anthropic models can be served by AWS Bedrock, GCP Vertex, or Anthropic\u2019s own infrastructure; OpenAI models are served by either Azure or OpenAI itself. If a vendor experiences an outage or degraded performance, our system automatically shifts requests to another, maintaining both streaming and non-streaming operations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Cross-model failover<\/h3>\n\n\n\n<p>Sometimes a new model that we started using for Fin might only be available on one vendor. Or, we might be serving a model using two vendors that alone don\u2019t have enough capacity to handle all of Fin\u2019s load. In this scenario, if one vendor had an outage, the second vendor would not be able to serve all the requests successfully.<\/p>\n\n\n\n<p>And, in the rarest of rare cases*, all the vendors that serve a particular model might have an outage at the same time.<\/p>\n\n\n\n<p>For all these cases, we can also failover across models. So if Sonnet 4 is unreachable on all vendors for any reason, we can send the request to a similarly capable GPT model. Similarly, we can partially failover some of the requests to a different model if we don\u2019t have enough capacity for it on the available vendors.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXdNgnzB1UBgypyB0lJP_QJs72ixrkTuHmY_OzKmFnjlVxoxEFaJL-nQGBWI26MRSXLbdd1EYW27RbiHgWFTB24CoLCdWXiIgzqxcp39V4kjmOOE2TN6rcXsl1UkMedzJRrPbdf_?key=hnhyEA4oWmPZs8PbYgF6Uw\" alt=\"\"\/><figcaption class=\"wp-element-caption\"><em>We moved traffic from OpenAI to Anthropic models during an <\/em><a href=\"https:\/\/status.openai.com\/incidents\/01JXCAW3K3JAE0EP56AEZ7CBG3\/write-up\"><em>OpenAI Outage on 10th June 2025<\/em><\/a><\/figcaption><\/figure>\n\n\n\n<p id=\"asterisk\">* rare, but not impossible. Many times there is a sort of co-dependency between the vendors and an issue at the wrong level can end up impacting services on multiple vendors.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Latency Based Routing<\/h3>\n\n\n\n<p>We support different modes of routing. Sometimes because of capacity or performance constraints, we might send only a particular percentage of requests to a vendor.<\/p>\n\n\n\n<p>The most interesting mode here is the \u201cLatency Based Routing\u201d. Performance can fluctuate between vendors throughout the day. We monitor real-time response times and route more traffic to the fastest available vendor, taking into account each vendor\u2019s capacity. In practice, choosing the fastest route can mean a difference of several seconds per request which is critical for end-user experience.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXe_15-7rrE3Bv21Jva6vnfnRXOxWK70ocgGX2EQQPngK1EfoG_KzwMIosEuVZSDRZGRtq6c0IeC9SmSsT7w9YY7hXv1d9BFbaJdBjvF5E7CyzSaKkuKa0B3Q4bq29ErBTCgnQmviQ?key=hnhyEA4oWmPZs8PbYgF6Uw\" alt=\"\"\/><figcaption class=\"wp-element-caption\"><em>Latency based routing responding to a performance degradation by moving requests to a better performing vendor<\/em><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Capacity Isolation<\/h3>\n\n\n\n<p>A classic failure reason for systems that share the same underlying resources is that a less-important usecase (like an asynchronous task for translating conversations) could end up exhausting this common resource, impacting more important usecases. In our case, Fin is the most important usecase we want to keep serving.<\/p>\n\n\n\n<p>To achieve that, our routing framework lets us define separate pools of capacity for Fin vs everything else. Each pool only has access to certain vendors or regions of the vendor. This isolation prevents Fin from ever being impacted by a less-important usecase.<\/p>\n\n\n\n<p>If Fin\u2019s assigned pool is exhausted, Fin can draw from other capacity, but lower-priority uses are never allowed to encroach on Fin\u2019s pool.<\/p>\n\n\n\n<h2 id=\"operational-safeguards-and-monitoring\" class=\"wp-block-heading\">Operational Safeguards and Monitoring<\/h2>\n\n\n\n<p>We have protections and processes in place to ensure the system does not diverge from its current reliable state and can withstand Fin\u2019s exponential growth.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Single Point of Failure reporting<\/h3>\n\n\n\n<p>We actively track each model and its setup for redundancy and capacity isolation. If any model falls short, we generate a high-priority alert and resolve the issue immediately.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXdFe-n5XP93wUU4Itl5qXFvSxLaQC8fepuBfmVhYO_Sw6vzimIpJnGptqT3f-lGUOA3SNZ4aKGEGvgEVsVsV2h2MU72k_Q7g63ujenS6gjAkP2NgBmAZUw33s0ZRFdfhUmUYxt9?key=hnhyEA4oWmPZs8PbYgF6Uw\" alt=\"\"\/><figcaption class=\"wp-element-caption\"><em>Example of an alert generated when the system detects a potential single point of failure<\/em><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Noisy Neighbor Protection<\/h3>\n\n\n\n<p><a href=\"https:\/\/docs.aws.amazon.com\/wellarchitected\/latest\/saas-lens\/noisy-neighbor.html\">Noisy Neighbor<\/a> is a well understood problem in multi-tenant systems. Our protections ensure a single customer or process cannot monopolise resources to the detriment of others.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Load Testing<\/h3>\n\n\n\n<p>We regularly perform load testing to ensure our systems can proactively support growth in Fin&#8217;s usage and maintain performance under heavy demand.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Maintaining Buffer LLM Capacity through strong relationships<\/h3>\n\n\n\n<p>Reliable operation requires buffer capacity. Through strong relationships with major vendors like OpenAI, Anthropic, AWS, Google, and Azure, we maintain ample headroom, with the ability to handle two to three times Fin\u2019s normal traffic at any point.<\/p>\n\n\n\n<p>This buffer capacity is not easy to come by and we have our account managers to thank for championing Intercom whenever we need extra capacity to run Fin reliably.<\/p>\n\n\n\n<h2 id=\"observability\" class=\"wp-block-heading\">Observability<\/h2>\n\n\n\n<p>Intercom has a strong observability culture. We have written before about <a href=\"https:\/\/www.intercom.com\/blog\/engineering-observability\/\">improving our observability posture<\/a> and how we like to <a href=\"https:\/\/www.intercom.com\/blog\/stop-monitoring-systems-start-monitoring-outcomes\/\">focus on the customer outcome<\/a>. Instrumenting every LLM call, we collect data on token usage, response times, and system load across vendors. These insights drive capacity planning, vendor selection, and rapid troubleshooting.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXfeJ7PdpNRavQW0eIbvA9TVTb3KPTyiZoTVIB_pKaMqpWQ63QaWHlBtwS_YlR_Zcys4YbT4Vfo9P5ROoC_o8b4BmLYifuKWcitqflA_98aOTcDdrw85-GZfQcSaRdMz9Qliax4w8A?key=hnhyEA4oWmPZs8PbYgF6Uw\" alt=\"\"\/><figcaption class=\"wp-element-caption\"><em>Datadog chart showing total tokens used for a sample usecase<\/em><\/figcaption><\/figure>\n\n\n\n<p>All of these sophisticated systems and processes ensure that Fin runs reliably, and the whole is greater than the sum of its parts.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXdNLNX_M6YdQYHC7yYhPcKGEXBR52-VHmj29bGaCZxa9OvUHu4OvMbcBa1ugMZsvDcDe-3ixFM4oq13fEI9aBjxar1CaD_Ubu3kxYj_hk4SUT8Y3oCqsQrUWiJK0j-dvyIY0OmqpQ?key=hnhyEA4oWmPZs8PbYgF6Uw\" alt=\"\" style=\"width:350px;height:auto\"\/><figcaption class=\"wp-element-caption\"><em>Fin Uptime for the month of July<\/em><\/figcaption><\/figure>\n\n\n\n<h2 id=\"future\" class=\"wp-block-heading\">Future<\/h2>\n\n\n\n<p>We will continue investing and making sure we can run Fin reliably as the demand for it grows. Here are some things we plan on taking up in the future.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Request Prioritization<\/h3>\n\n\n\n<p>The current architecture protects Fin from other usecases exhausting all the LLM resources, but many times Fin doesn\u2019t need the capacity we have provisioned for it. In these cases, this extra capacity could be shared with other usecases which otherwise run constrained. This can be achieved by assigning a priority to each LLM request, and then dropping the lower priority requests when LLM capacity is constrained. Such a solution would make sure Fin can use all the capacity it needs without constraining the other usecases when there is no need for it.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Chaos Monkey<\/h3>\n\n\n\n<p>So far we haven\u2019t needed this as regular vendor outages keep our systems well tested. But as vendors grow more reliable and as we move towards more automated failovers, short lived outages can go undetected. This can give us a false sense of security which makes it important to regularly test our systems by causing controlled outages and ensuring Fin won\u2019t be impacted by a particular vendor or model going down.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Exploring using an off-the-shelf proxy for routing<\/h3>\n\n\n\n<p>Fin\u2019s reliability and elasticity is a competitive advantage for Intercom, so we are happy to lean towards \u201cbuild\u201d in the build vs buy decision. This was more straightforward when we started building our routing framework as there weren\u2019t many options available that provided the flexibility we needed.<\/p>\n\n\n\n<p>But tools like LiteLLM and Bifrost have covered a lot of ground since then. While we are proud of what we have built, we don\u2019t want to maintain a system if we don\u2019t need to. We will still need to make sure we don\u2019t introduce a new single point of failure by using these tools, but they look promising and are worth exploring.<\/p>\n\n\n\n<p>Ultimately, the real measure of our work is that Fin stays available to solve problems for the people who depend on it. The work is ongoing, and reliability is never finished, but with these systems in place, we ensure Fin remains a reliable part of our customers\u2019 workflows.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Building reliable large language model (LLM) inference is still an emerging discipline. Although the field has matured considerably in recent years, we are far from the level of dependability seen in industry-standard services such as Amazon&hellip;<\/p>\n","protected":false},"author":21,"featured_media":158,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"coauthors":[16],"class_list":["post-416","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v24.6 (Yoast SEO v24.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Fin: Running a Reliable Service over Unreliable Parts - \/research<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Fin: Running a Reliable Service over Unreliable Parts\" \/>\n<meta property=\"og:description\" content=\"Building reliable large language model (LLM) inference is still an emerging discipline. Although the field has matured considerably in recent years, we are far from the level of dependability seen in industry-standard services such as Amazon&hellip;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/\" \/>\n<meta property=\"og:site_name\" content=\"\/research\" \/>\n<meta property=\"article:published_time\" content=\"2025-08-08T16:51:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-14.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1344\" \/>\n\t<meta property=\"og:image:height\" content=\"896\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Ketan Bhatt\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@intercom\" \/>\n<meta name=\"twitter:site\" content=\"@intercom\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Ketan Bhatt\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/\"},\"author\":{\"name\":\"Ketan Bhatt\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/081a05bc45e69c9357769378d6120ed2\"},\"headline\":\"Fin: Running a Reliable Service over Unreliable Parts\",\"datePublished\":\"2025-08-08T16:51:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/\"},\"wordCount\":1350,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/fin.ai\/research\/#organization\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-14.png\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/\",\"url\":\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/\",\"name\":\"Fin: Running a Reliable Service over Unreliable Parts - \/research\",\"isPartOf\":{\"@id\":\"https:\/\/fin.ai\/research\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-14.png\",\"datePublished\":\"2025-08-08T16:51:00+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#primaryimage\",\"url\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-14.png\",\"contentUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-14.png\",\"width\":1344,\"height\":896},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/fin.ai\/research\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Fin: Running a Reliable Service over Unreliable Parts\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/fin.ai\/research\/#website\",\"url\":\"https:\/\/fin.ai\/research\/\",\"name\":\"Intercom.ai\",\"description\":\"Insights and blogs from the AI Group building Fin\",\"publisher\":{\"@id\":\"https:\/\/fin.ai\/research\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/fin.ai\/research\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/fin.ai\/research\/#organization\",\"name\":\"Intercom.ai\",\"url\":\"https:\/\/fin.ai\/research\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png\",\"contentUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png\",\"width\":1024,\"height\":1024,\"caption\":\"Intercom.ai\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/intercom\",\"https:\/\/www.linkedin.com\/company\/intercom\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/081a05bc45e69c9357769378d6120ed2\",\"name\":\"Ketan Bhatt\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/image\/75befbfa6fe8aec903fac70c46847a86\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/b852fe1c1b88c4f44667b8463c035d66?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/b852fe1c1b88c4f44667b8463c035d66?s=96&d=mm&r=g\",\"caption\":\"Ketan Bhatt\"},\"url\":\"https:\/\/fin.ai\/research\/author\/ketan-bhatt\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Fin: Running a Reliable Service over Unreliable Parts - \/research","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/","og_locale":"en_US","og_type":"article","og_title":"Fin: Running a Reliable Service over Unreliable Parts","og_description":"Building reliable large language model (LLM) inference is still an emerging discipline. Although the field has matured considerably in recent years, we are far from the level of dependability seen in industry-standard services such as Amazon&hellip;","og_url":"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/","og_site_name":"\/research","article_published_time":"2025-08-08T16:51:00+00:00","og_image":[{"width":1344,"height":896,"url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-14.png","type":"image\/png"}],"author":"Ketan Bhatt","twitter_card":"summary_large_image","twitter_creator":"@intercom","twitter_site":"@intercom","twitter_misc":{"Written by":"Ketan Bhatt","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#article","isPartOf":{"@id":"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/"},"author":{"name":"Ketan Bhatt","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/081a05bc45e69c9357769378d6120ed2"},"headline":"Fin: Running a Reliable Service over Unreliable Parts","datePublished":"2025-08-08T16:51:00+00:00","mainEntityOfPage":{"@id":"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/"},"wordCount":1350,"commentCount":0,"publisher":{"@id":"https:\/\/fin.ai\/research\/#organization"},"image":{"@id":"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#primaryimage"},"thumbnailUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-14.png","inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/","url":"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/","name":"Fin: Running a Reliable Service over Unreliable Parts - \/research","isPartOf":{"@id":"https:\/\/fin.ai\/research\/#website"},"primaryImageOfPage":{"@id":"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#primaryimage"},"image":{"@id":"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#primaryimage"},"thumbnailUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-14.png","datePublished":"2025-08-08T16:51:00+00:00","breadcrumb":{"@id":"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#primaryimage","url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-14.png","contentUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-14.png","width":1344,"height":896},{"@type":"BreadcrumbList","@id":"https:\/\/fin.ai\/research\/fin-running-a-reliable-service-over-unreliable-parts\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/fin.ai\/research\/"},{"@type":"ListItem","position":2,"name":"Fin: Running a Reliable Service over Unreliable Parts"}]},{"@type":"WebSite","@id":"https:\/\/fin.ai\/research\/#website","url":"https:\/\/fin.ai\/research\/","name":"Intercom.ai","description":"Insights and blogs from the AI Group building Fin","publisher":{"@id":"https:\/\/fin.ai\/research\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/fin.ai\/research\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/fin.ai\/research\/#organization","name":"Intercom.ai","url":"https:\/\/fin.ai\/research\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/","url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png","contentUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png","width":1024,"height":1024,"caption":"Intercom.ai"},"image":{"@id":"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/intercom","https:\/\/www.linkedin.com\/company\/intercom"]},{"@type":"Person","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/081a05bc45e69c9357769378d6120ed2","name":"Ketan Bhatt","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/image\/75befbfa6fe8aec903fac70c46847a86","url":"https:\/\/secure.gravatar.com\/avatar\/b852fe1c1b88c4f44667b8463c035d66?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/b852fe1c1b88c4f44667b8463c035d66?s=96&d=mm&r=g","caption":"Ketan Bhatt"},"url":"https:\/\/fin.ai\/research\/author\/ketan-bhatt\/"}]}},"_links":{"self":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts\/416","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/comments?post=416"}],"version-history":[{"count":0,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts\/416\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/media\/158"}],"wp:attachment":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/media?parent=416"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/categories?post=416"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/tags?post=416"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/coauthors?post=416"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}