{"id":399,"date":"2025-09-10T15:56:18","date_gmt":"2025-09-10T15:56:18","guid":{"rendered":"https:\/\/fin.ai\/research\/?p=399"},"modified":"2025-09-12T09:39:24","modified_gmt":"2025-09-12T09:39:24","slug":"cost-of-serving-llms","status":"publish","type":"post","link":"https:\/\/fin.ai\/research\/cost-of-serving-llms\/","title":{"rendered":"Cost of Serving LLMs"},"content":{"rendered":"\n<p>TLDR: we explored AWS hardware options (and serving engines), and it turned out that a self-serving LLM can be significantly more cost-effective than commercial APIs.<\/p>\n\n\n\n<h2 id=\"introduction\" class=\"wp-block-heading\"><strong>Introduction<\/strong><\/h2>\n\n\n\n<p>Fin is an advanced customer support AI agent powered by a combination of large language models (LLMs) and retrieval-augmented generation (RAG) techniques. This hybrid engine enables Fin to deliver accurate, context-aware responses. However, Fin often relies on generalist LLMs provided by vendors such as OpenAI and Anthropic\u2014solutions that can be both expensive and slow due to the size and complexity of the models.<\/p>\n\n\n\n<p>We wanted to explore whether certain tasks currently handled by these large external models could instead be managed by smaller, fine-tuned LLMs that we host ourselves. Our goal was to evaluate the feasibility and cost-effectiveness of serving these models on our own AWS infrastructure.<\/p>\n\n\n\n<p>This analysis involved:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Exploring AWS hardware options and serving engines<\/li>\n\n\n\n<li>Benchmarking various open-weight models<\/li>\n\n\n\n<li>Estimating cost differences between self-hosting and API usage<\/li>\n<\/ul>\n\n\n\n<h2 id=\"cost-comparison-hosted-vs-provider\" class=\"wp-block-heading\"><strong>Cost Comparison: Hosted vs. Provider<\/strong><\/h2>\n\n\n\n<p>Comparing the cost of self-hosting LLMs versus using provider APIs is not straightforward. The pricing models differ fundamentally:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Self-hosting:<\/strong> You pay for infrastructure (e.g., GPU instances) over time.<\/li>\n\n\n\n<li><strong>Provider APIs:<\/strong> You pay per token processed, with separate pricing for prompt, completion, and cached tokens.<\/li>\n<\/ul>\n\n\n\n<p>In AWS, costs vary significantly depending on whether instances are reserved in advance or acquired on demand\u2014reservation can be 2\u20133\u00d7 cheaper. To fairly compare costs, we estimated the number of reserved instances required to serve our existing traffic reliably.<\/p>\n\n\n\n<p>Instead of using &#8220;requests per second&#8221; (common for traditional web services), we measured <strong>tokens per second (TPS)<\/strong>, as token usage can vary greatly per request in LLM services.<\/p>\n\n\n\n<p>We used two key data points:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Maximum sustainable token throughput per instance per model<\/strong><\/li>\n\n\n\n<li><strong>Peak token usage per minute from historical Fin traffic<\/strong><\/li>\n<\/ul>\n\n\n\n<p>This allowed us to estimate the number of nodes or GPUs needed per feature and model, and ultimately the cost to serve each model.<\/p>\n\n\n\n<h2 id=\"benchmarking-setup\" class=\"wp-block-heading\"><strong>Benchmarking Setup<\/strong><\/h2>\n\n\n\n<p>We benchmarked a variety of open-weight LLMs across multiple AWS hardware configurations. The models included:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Qwen 3<\/strong>: 14B, 30B, 235B<\/li>\n\n\n\n<li><strong>Gemma 3<\/strong>: 4B, 27B<\/li>\n\n\n\n<li><strong>DeepSeek Llama<\/strong>: 8B, 70B (distilled)<\/li>\n<\/ul>\n\n\n\n<p>We tested these models using real Fin prompts and organized them into three datasets, categorized by prompt complexity and model size.<\/p>\n\n\n\n<p><strong>Hardware tested:<\/strong><\/p>\n\n\n\n<p>For simplicity and to allow increased flexibility, we used EC2 instances for the benchmarks:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>g6e (L40s GPUs)<\/li>\n\n\n\n<li>p4d\/p4de (A100)&nbsp;<\/li>\n\n\n\n<li>p5\/p5e (H100\/H200)<\/li>\n<\/ul>\n\n\n\n<p><strong>Inference engines tested:<\/strong><\/p>\n\n\n\n<p>There are many available inference engines that can be used today, but we primarily looked into the most popular ones: VLLM, SGLang and TensorRT. However, because TensorRT did not support Qwen 3 and Gemma 3 models at the time, we went forward only with VLLM and SGLang. Both engines showed comparable performance across configurations using default settings.<\/p>\n\n\n\n<p><strong>Latency constraints:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Time to First Token (TTFT): \u2264 250ms<\/li>\n\n\n\n<li>Inter-Token Latency (ITL): \u2264 30ms<\/li>\n<\/ul>\n\n\n\n<p>These targets were based on Fin\u2019s production latency requirements. Our goal was to <strong>maximise TPS<\/strong> without exceeding these latency thresholds.<\/p>\n\n\n\n<h2 id=\"results\" class=\"wp-block-heading\"><strong>Results<\/strong><\/h2>\n\n\n\n<p>We compared our internal serving cost estimates against popular provider models:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>GPT-4.1 mini<\/strong><\/li>\n\n\n\n<li><strong>GPT-4.1<\/strong><\/li>\n\n\n\n<li><strong>Anthropic Sonnet 3.7<\/strong><\/li>\n<\/ul>\n\n\n\n<p>We calculated monthly token usage per dataset (prompt, output, cached) to estimate provider costs. We then benchmarked each model across all hardware configurations using both VLLM and SGLang.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Cost Ratio Comparison<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th>Model<\/th><th class=\"has-text-align-center\" data-align=\"center\">Cost vs GPT-4.1<\/th><th class=\"has-text-align-center\" data-align=\"center\">Cost vs GPT-4.1 mini<\/th><th class=\"has-text-align-center\" data-align=\"center\">Cost vs Sonnet 3.7<\/th><\/tr><\/thead><tbody><tr><td>Gemma 3 4B<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.04<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.20<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.01<\/td><\/tr><tr><td>DeepSeek Llama 8B<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.05<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.27<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.01<\/td><\/tr><tr><td>Qwen 3 14B<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.05<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.27<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.01<\/td><\/tr><tr><td>Gemma 3 27B<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.34<\/td><td class=\"has-text-align-center\" data-align=\"center\">1.71<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.08<\/td><\/tr><tr><td>Qwen 3 30B<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.42<\/td><td class=\"has-text-align-center\" data-align=\"center\">2.12<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.10<\/td><\/tr><tr><td>DeepSeek Llama 70B<\/td><td class=\"has-text-align-center\" data-align=\"center\">1.70<\/td><td class=\"has-text-align-center\" data-align=\"center\">8.49<\/td><td class=\"has-text-align-center\" data-align=\"center\">1.10<\/td><\/tr><tr><td>Qwen 3 235B<\/td><td class=\"has-text-align-center\" data-align=\"center\">2.17<\/td><td class=\"has-text-align-center\" data-align=\"center\">10.83<\/td><td class=\"has-text-align-center\" data-align=\"center\">1.40<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><strong>Key insight<\/strong>: Smaller models (&lt;14B) were significantly cheaper to serve than GPT-4.1 mini, while mid-sized models like Gemma 27B and Qwen 30B were more affordable than GPT-4.1.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Hardware Cost Efficiency (Qwen 3 14B)<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th class=\"has-text-align-center\" data-align=\"center\"><strong>Hardware<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Cost\/hr<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Max TPS<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>TPS\/Cost Ratio<\/strong><\/th><\/tr><\/thead><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\">1\u00d7L40s<\/td><td class=\"has-text-align-center\" data-align=\"center\">$2.37<\/td><td class=\"has-text-align-center\" data-align=\"center\">34,000<\/td><td class=\"has-text-align-center\" data-align=\"center\">14,345.99<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">4\u00d7L40s<\/td><td class=\"has-text-align-center\" data-align=\"center\">$10.48<\/td><td class=\"has-text-align-center\" data-align=\"center\">36,000<\/td><td class=\"has-text-align-center\" data-align=\"center\">3,435.11<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">1\u00d7A100<\/td><td class=\"has-text-align-center\" data-align=\"center\">$1.91<\/td><td class=\"has-text-align-center\" data-align=\"center\">70,000<\/td><td class=\"has-text-align-center\" data-align=\"center\">36,649.21<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">1\u00d7H100<\/td><td class=\"has-text-align-center\" data-align=\"center\">$3.90<\/td><td class=\"has-text-align-center\" data-align=\"center\">135,000<\/td><td class=\"has-text-align-center\" data-align=\"center\">34,615.38<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><strong>Observations:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>g6e (L40s)<\/strong>: Multi-GPU setups lacked NVLink and relied on PCIe, making scaling inefficient.<\/li>\n\n\n\n<li><strong>p4\/p5 (A100\/H100)<\/strong>: NVLink-enabled bandwidth allowed efficient tensor parallelism, boosting performance and cost-efficiency.<\/li>\n\n\n\n<li><strong>A100 80GB<\/strong>: Outperformed L40s in both performance and cost.<\/li>\n\n\n\n<li><strong>H100\/H200<\/strong>: Though expensive, delivered high throughput and favourable cost\/TPS ratios.<\/li>\n<\/ul>\n\n\n\n<h2 id=\"conclusion\" class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p>Our benchmarks showed that <strong>self-hosting smaller and medium-sized LLMs can be significantly more cost-effective<\/strong> than relying on commercial APIs\u2014especially with models under 30B parameters. To serve larger models efficiently, further work is needed to optimise engine performance. While EC2 provided flexibility for benchmarking, it\u2019s not ideal for production. We plan to transition to <strong>EKS<\/strong> (Elastic Kubernetes Service) and potentially <strong>Sagemaker<\/strong> <strong>Hyperpod<\/strong> for live model serving.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>TLDR: we explored AWS hardware options (and serving engines), and it turned out that a self-serving LLM can be significantly more cost-effective than commercial APIs. Introduction Fin is an advanced customer support AI agent powered by&hellip;<\/p>\n","protected":false},"author":41,"featured_media":156,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"coauthors":[29],"class_list":["post-399","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v24.6 (Yoast SEO v24.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Cost of Serving LLMs - \/research<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Cost of Serving LLMs\" \/>\n<meta property=\"og:description\" content=\"TLDR: we explored AWS hardware options (and serving engines), and it turned out that a self-serving LLM can be significantly more cost-effective than commercial APIs. Introduction Fin is an advanced customer support AI agent powered by&hellip;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/\" \/>\n<meta property=\"og:site_name\" content=\"\/research\" \/>\n<meta property=\"article:published_time\" content=\"2025-09-10T15:56:18+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-09-12T09:39:24+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1344\" \/>\n\t<meta property=\"og:image:height\" content=\"896\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Stefan Ivanovici\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@intercom\" \/>\n<meta name=\"twitter:site\" content=\"@intercom\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Stefan Ivanovici\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/\"},\"author\":{\"name\":\"Stefan Ivanovici\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/94b52998ba014b7284ff40fc63997f69\"},\"headline\":\"Cost of Serving LLMs\",\"datePublished\":\"2025-09-10T15:56:18+00:00\",\"dateModified\":\"2025-09-12T09:39:24+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/\"},\"wordCount\":739,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/fin.ai\/research\/#organization\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16-1.png\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/\",\"url\":\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/\",\"name\":\"Cost of Serving LLMs - \/research\",\"isPartOf\":{\"@id\":\"https:\/\/fin.ai\/research\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16-1.png\",\"datePublished\":\"2025-09-10T15:56:18+00:00\",\"dateModified\":\"2025-09-12T09:39:24+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#primaryimage\",\"url\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16-1.png\",\"contentUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16-1.png\",\"width\":1344,\"height\":896},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/fin.ai\/research\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Cost of Serving LLMs\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/fin.ai\/research\/#website\",\"url\":\"https:\/\/fin.ai\/research\/\",\"name\":\"Intercom.ai\",\"description\":\"Insights and blogs from the AI Group building Fin\",\"publisher\":{\"@id\":\"https:\/\/fin.ai\/research\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/fin.ai\/research\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/fin.ai\/research\/#organization\",\"name\":\"Intercom.ai\",\"url\":\"https:\/\/fin.ai\/research\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png\",\"contentUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png\",\"width\":1024,\"height\":1024,\"caption\":\"Intercom.ai\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/intercom\",\"https:\/\/www.linkedin.com\/company\/intercom\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/94b52998ba014b7284ff40fc63997f69\",\"name\":\"Stefan Ivanovici\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/image\/ecffa390eac24a9cee1c6fcbf52920a6\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/c6ad8d42e297a94fae3c4363f97e0038?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/c6ad8d42e297a94fae3c4363f97e0038?s=96&d=mm&r=g\",\"caption\":\"Stefan Ivanovici\"},\"description\":\"is a Senior Product Engineer in Intercom's AI Team. Most of his current work is related to fine-tuning LLMs used in Fin and other parts of Intercom product.\",\"url\":\"https:\/\/fin.ai\/research\/author\/stefan-ivanovici\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Cost of Serving LLMs - \/research","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/fin.ai\/research\/cost-of-serving-llms\/","og_locale":"en_US","og_type":"article","og_title":"Cost of Serving LLMs","og_description":"TLDR: we explored AWS hardware options (and serving engines), and it turned out that a self-serving LLM can be significantly more cost-effective than commercial APIs. Introduction Fin is an advanced customer support AI agent powered by&hellip;","og_url":"https:\/\/fin.ai\/research\/cost-of-serving-llms\/","og_site_name":"\/research","article_published_time":"2025-09-10T15:56:18+00:00","article_modified_time":"2025-09-12T09:39:24+00:00","og_image":[{"width":1344,"height":896,"url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16-1.png","type":"image\/png"}],"author":"Stefan Ivanovici","twitter_card":"summary_large_image","twitter_creator":"@intercom","twitter_site":"@intercom","twitter_misc":{"Written by":"Stefan Ivanovici","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#article","isPartOf":{"@id":"https:\/\/fin.ai\/research\/cost-of-serving-llms\/"},"author":{"name":"Stefan Ivanovici","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/94b52998ba014b7284ff40fc63997f69"},"headline":"Cost of Serving LLMs","datePublished":"2025-09-10T15:56:18+00:00","dateModified":"2025-09-12T09:39:24+00:00","mainEntityOfPage":{"@id":"https:\/\/fin.ai\/research\/cost-of-serving-llms\/"},"wordCount":739,"commentCount":0,"publisher":{"@id":"https:\/\/fin.ai\/research\/#organization"},"image":{"@id":"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#primaryimage"},"thumbnailUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16-1.png","inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/fin.ai\/research\/cost-of-serving-llms\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/fin.ai\/research\/cost-of-serving-llms\/","url":"https:\/\/fin.ai\/research\/cost-of-serving-llms\/","name":"Cost of Serving LLMs - \/research","isPartOf":{"@id":"https:\/\/fin.ai\/research\/#website"},"primaryImageOfPage":{"@id":"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#primaryimage"},"image":{"@id":"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#primaryimage"},"thumbnailUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16-1.png","datePublished":"2025-09-10T15:56:18+00:00","dateModified":"2025-09-12T09:39:24+00:00","breadcrumb":{"@id":"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/fin.ai\/research\/cost-of-serving-llms\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#primaryimage","url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16-1.png","contentUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16-1.png","width":1344,"height":896},{"@type":"BreadcrumbList","@id":"https:\/\/fin.ai\/research\/cost-of-serving-llms\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/fin.ai\/research\/"},{"@type":"ListItem","position":2,"name":"Cost of Serving LLMs"}]},{"@type":"WebSite","@id":"https:\/\/fin.ai\/research\/#website","url":"https:\/\/fin.ai\/research\/","name":"Intercom.ai","description":"Insights and blogs from the AI Group building Fin","publisher":{"@id":"https:\/\/fin.ai\/research\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/fin.ai\/research\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/fin.ai\/research\/#organization","name":"Intercom.ai","url":"https:\/\/fin.ai\/research\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/","url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png","contentUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png","width":1024,"height":1024,"caption":"Intercom.ai"},"image":{"@id":"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/intercom","https:\/\/www.linkedin.com\/company\/intercom"]},{"@type":"Person","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/94b52998ba014b7284ff40fc63997f69","name":"Stefan Ivanovici","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/image\/ecffa390eac24a9cee1c6fcbf52920a6","url":"https:\/\/secure.gravatar.com\/avatar\/c6ad8d42e297a94fae3c4363f97e0038?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/c6ad8d42e297a94fae3c4363f97e0038?s=96&d=mm&r=g","caption":"Stefan Ivanovici"},"description":"is a Senior Product Engineer in Intercom's AI Team. Most of his current work is related to fine-tuning LLMs used in Fin and other parts of Intercom product.","url":"https:\/\/fin.ai\/research\/author\/stefan-ivanovici\/"}]}},"_links":{"self":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts\/399","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/users\/41"}],"replies":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/comments?post=399"}],"version-history":[{"count":0,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts\/399\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/media\/156"}],"wp:attachment":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/media?parent=399"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/categories?post=399"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/tags?post=399"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/coauthors?post=399"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}