{"id":359,"date":"2025-09-11T22:45:50","date_gmt":"2025-09-11T22:45:50","guid":{"rendered":"https:\/\/fin.ai\/research\/?p=359"},"modified":"2025-09-12T09:37:30","modified_gmt":"2025-09-12T09:37:30","slug":"david-vs-goliath-are-small-llms-any-good","status":"publish","type":"post","link":"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/","title":{"rendered":"David vs Goliath: are small LLMs any good?"},"content":{"rendered":"\n<p class=\"has-2-xl-font-size\">Are smaller fine-tuned LLMs competent for Intercom scale tasks?<\/p>\n\n\n\n<p>Large Language Models (LLMs) are a powerful tech that have turned reasoning in natural language, into a service. They&#8217;ve had a huge impact on customer support, powering agents like Fin. Fin is already delivering real value, with many customers routinely experiencing resolution rates in the high 70s, and an overall average resolution rate <a href=\"https:\/\/fin.ai\/#performance\">across all customers of upwards of 60%<\/a>.\u00a0<\/p>\n\n\n\n<p>Now that Fin is a mature product, we can start testing more ambitious ideas. One key hypothesis is that for narrow, well-scoped tasks, we might match the performance of much larger models by training smaller, more efficient ones, on enough high-quality data.<\/p>\n\n\n\n<h2 id=\"fin-primer\" class=\"wp-block-heading\">Fin primer&nbsp;<\/h2>\n\n\n\n<p>A part of Fin\u2019s architecture builds on the RAG foundations to achieve an optimal experience for <em>informational<\/em> customer support use cases. This is built with components that try to understand the message exchanges with the end user and summarise the user\u2019s problem, retrieve<strong> <\/strong>relevant information as passages, rerank them, and then generate an answer. The diagram below is a rough representation of how this flow works.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"343\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-13-1-1024x343.png\" alt=\"\" class=\"wp-image-392\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-13-1-1024x343.png 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-13-1-300x101.png 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-13-1-768x258.png 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-13-1-1536x515.png 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-13-1-2048x687.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The goal here is to first focus on a well scoped, narrow task, that we could train a smaller LLM for: detect and extract the user&#8217;s issue.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Detect and extract issue summary&nbsp;<\/h3>\n\n\n\n<p>Issue detection and extraction is an important component of Fin\u2019s RAG pipeline, where a series of messages between the end user and Fin are transformed into a single answerable summary issue, which is then used for the downstream retrieval pipeline.\u00a0<\/p>\n\n\n\n<p><strong>The problem?<\/strong> Not all conversations have addressable issues. The old baseline setup used just one &#8220;issue detection and extraction&#8221; prompt: if there was no outstanding issue, it returned None.&nbsp;<\/p>\n\n\n\n<p>But in reality, issue detection has lots of tricky edge cases, like:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If a user gave negative feedback at the end, we want to catch it as non-informational, even if there&#8217;s still an outstanding issue.<\/li>\n\n\n\n<li>Users often mix feedback, greetings, or random noise with updates to their previous request, making intent hard to spot.<\/li>\n<\/ul>\n\n\n\n<p>To deal with this, our prompt kept growing in order to cover for the nuances of non-issues \u2013 over 80 few-shot examples, &gt;5k tokens of instructions, and 17 defined non-informational categories. A few categories of non-issue examples can be found in Table 1 below.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th>Category<\/th><th>Examples<\/th><\/tr><\/thead><tbody><tr><td>Greetings<\/td><td>\u201cHi\u201d, \u201cHello\u201d, \u201cGood morning\u201d<\/td><\/tr><tr><td>Goodbyes<\/td><td>\u201cBye\u201d, \u201cSee you\u201d, \u201cGoodbye\u201d, \u201cThat\u2019s all\u201d<\/td><\/tr><tr><td>Negative reactions (no new info)<\/td><td>\u201cNo\u201d, \u201cUseless\u201d, \u201cNot helpful\u201d, \u201cWTF\u201d<\/td><\/tr><tr><td>Acknowledgments<\/td><td>\u201cOk\u201d, \u201cGot it\u201d, \u201cUnderstood\u201d, \u201cMakes sense\u201d<\/td><\/tr><tr><td>Gratitude<\/td><td>\u201cThank you\u201d, \u201cThanks\u201d, \u201cThx\u201d, \u201cYou\u2019re the best\u201d<\/td><\/tr><tr><td>Positive reactions<\/td><td>\u201cPerfect\u201d, \u201cAwesome\u201d, \u201cGreat\u201d, \u201cCool\u201d<\/td><\/tr><tr><td>Connection checks<\/td><td>\u201cAre you there?\u201d, \u201cAre you still online?\u201d<\/td><\/tr><tr><td>Pleasantries<\/td><td>\u201cHow are you?\u201d, \u201cWhat\u2019s up?\u201d, \u201cNice to meet you\u201d<\/td><\/tr><tr><td>Small talk<\/td><td>\u201cNice weather\u201d, \u201cMerry Christmas!\u201d<\/td><\/tr><tr><td>Meta-commentary<\/td><td>\u201cThat\u2019s interesting\u201d, \u201cYou\u2019re fast\u201d<\/td><\/tr><tr><td>Bot identity questions<\/td><td>\u201cWho are you?\u201d, \u201cAm I talking to AI?\u201d<\/td><\/tr><tr><td>Fillers &amp; expressions<\/td><td>\u201chmmm\u201d, \u201cummm\u201d, \u201chaha\u201d, \u201clol\u201d, \u201c\ud83d\ude02\u201d<\/td><\/tr><tr><td>Testing<\/td><td>\u201ctest\u201d, \u201ctesting\u201d, \u201cping\u201d, \u201chello world\u201d<\/td><\/tr><tr><td>Gibberish<\/td><td>\u201casldkjfasldkjf\u201d, \u201coompa loompa\u201d, \u201c123\u201d, \u201caaa\u201d, \u201c\u2026\u201d<\/td><\/tr><tr><td>Thinking<\/td><td>\u201cLet me think\u201d, \u201cOne moment\u201d, \u201cbrb\u201d<\/td><\/tr><tr><td>Customer withdraws request<\/td><td>\u201cNever mind\u201d, \u201cDon\u2019t worry about it\u201d, \u201cIgnore that\u201d<\/td><\/tr><tr><td>Indicating a question without stating it<\/td><td>\u201cI have a question\u201d, \u201cWait, I have something else\u201d<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>These nuances of issue detection made this prompt a prime candidate for experimenting with custom modelling. We can split the problem into two independent models, one that classifies an interaction as one with or without an issue, and another that extracts an issue if the first model thinks there is one.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"425\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-12-1-1024x425.png\" alt=\"\" class=\"wp-image-391\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-12-1-1024x425.png 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-12-1-300x124.png 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-12-1-768x318.png 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-12-1-1536x637.png 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-12-1-2048x849.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>This split strategy now makes the <strong>issue extraction task on its own a narrow task<\/strong>, allowing us to experiment with fine-tuned LLMs.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How do we measure success?&nbsp;<\/h3>\n\n\n\n<p>Before we talk about the model training effort, we need to have a clear definition of what success looks like. Following metrics are the key indicators of Fin\u2019s health and performance:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Offline metrics: <\/strong>These metrics are measured by locally replaying a sample of Fin&#8217;s production requests via the new feature:\n<ul class=\"wp-block-list\">\n<li><strong>Answer rates:<\/strong> This rate measures the fraction of times Fin was able to provide an answer for a real production query, when the new models were injected in the RAG process. Any large statistically significant drop in this number is an indicator of performance deterioration. However, a small change might not directly imply an actual degradation in production, which has been observed time and again.\u00a0<\/li>\n\n\n\n<li><strong>Semantic alignment:<\/strong> The fine\u2011tuned model should generate issues whose meaning closely matches the production issues extracted by the large LLMs. We quantify this by computing the distance (e.g., cosine similarity) between the embedding of the production issue and the embedding of the corresponding issue produced by the fine\u2011tuned model.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Online Metrics: <\/strong>These metrics are measured via an A\/B test in production:<strong>&nbsp;<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>Resolution rates<\/strong>: This is the foundational metric that directly impacts Fin\u2019s bottom line. No matter how good the model behaves offline, if it significantly deteriorates this metric, it is not a success. This metric can be split into\n<ul class=\"wp-block-list\">\n<li>Hard resolutions : Resolutions where the end user acknowledges that the answer actually solved the problem<\/li>\n\n\n\n<li>Soft resolutions: Resolutions where there is no explicit acknowledgement or <em>positive<\/em> feedback from the user.&nbsp;<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>CSAT<\/strong>: Customer satisfaction (CSAT) score indicates the quality of what Fin provides.&nbsp;<\/li>\n\n\n\n<li><strong>Latency: <\/strong>Latency has been an important metric to track for our product experience. <a href=\"https:\/\/fin.ai\/research\/does-slower-seem-smarter-rethinking-latency-in-ai-agents\/\">We are constantly trying to bring this metric<\/a> down, allowing end users to experience a seamless low latency interaction [5]. We want to make sure at the very least, this number remains the same.&nbsp;<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Cost: <\/strong>Often talked about in terms of amortised cost per token generated, this metric is an important one to track, especially for custom fine-tuned models. A comparable online performance, but at 2x the cost is not a success.&nbsp;<\/li>\n<\/ul>\n\n\n\n<h2 id=\"training-an-issue-classifier-model\" class=\"wp-block-heading\">Training an Issue classifier model&nbsp;<\/h2>\n\n\n\n<p>For our issue classifier, we started by curating training data from our original LLM-based system, which was designed to both detect and extract issues in conversations. Here\u2019s what the new setup does:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We encode an input with ModernBERT<\/li>\n\n\n\n<li>A simple linear layer with sigmoid predicts the binary label: informational or non-informational<\/li>\n\n\n\n<li>The model is trained with binary cross-entropy loss<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"445\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-11-2-1024x445.png\" alt=\"\" class=\"wp-image-390\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-11-2-1024x445.png 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-11-2-300x130.png 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-11-2-768x334.png 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-11-2-1536x667.png 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-11-2-2048x890.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><a href=\"https:\/\/arxiv.org\/pdf\/2412.13663\">ModernBERT<\/a> is a newer flavor of encoder-only transformer based models, and it outperforms almost all BERT-like models on retrieval and classification tasks [4]. As you can see in our blogs on <a href=\"https:\/\/fin.ai\/research\/finetuning-retrieval-for-fin\/\">retrieval<\/a>, <a href=\"https:\/\/fin.ai\/research\/how-we-built-a-world-class-reranker-for-fin\/\">reranker<\/a>, <a href=\"https:\/\/fin.ai\/research\/was-that-helpful-understanding-user-feedback-in-customer-support-ai-agents\/\">parsing feedback<\/a> and <a href=\"https:\/\/fin.ai\/research\/to-escalate-or-not-to-escalate-that-is-the-question\/\">escalation detection<\/a>. ModernBERT works really well for routing and classification tasks, once you fine tune it on the right data.<br>For the issue classification task, we trained the model on 1M examples and ModernBERT achieved a remarkable <strong>0.995 AUC score<\/strong>. When ModernBERT&#8217;s results didn&#8217;t match the ground truth, it was mostly the teacher&#8217;s mistakes, not the student model&#8217;s.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"377\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/09\/Screenshot_transparent_allwhite_002024-1-1024x377.png\" alt=\"\" class=\"wp-image-479\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/09\/Screenshot_transparent_allwhite_002024-1-1024x377.png 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/09\/Screenshot_transparent_allwhite_002024-1-300x111.png 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/09\/Screenshot_transparent_allwhite_002024-1-768x283.png 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/09\/Screenshot_transparent_allwhite_002024-1-1536x566.png 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/09\/Screenshot_transparent_allwhite_002024-1-1320x486.png 1320w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/09\/Screenshot_transparent_allwhite_002024-1.png 1639w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 id=\"fine-tuning-the-issue-extraction-model\" class=\"wp-block-heading\">Fine-tuning the issue extraction model<\/h2>\n\n\n\n<p>Finetuning a generative language model, implies taking an open sourced model \u2013which is already trained on trillions of tokens from the internet, achieving a baseline level of performance on a diverse set of tasks \u2013 and changing its weights slightly to optimise for a particular task.&nbsp;<\/p>\n\n\n\n<p>We use a specific way of fine tuning called Low-Rank Adapter(LoRA) [1][2]&nbsp; based tuning, which is an extremely parameter efficient way of finetuning language models. LoRA freezes the large model\u2019s original weights and learns only two much smaller, low-<a href=\"https:\/\/en.wikipedia.org\/wiki\/Rank_(linear_algebra)\">rank<\/a> matrices per targeted layer. This setup slashes trainable parameters by orders of magnitude while preserving quality.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"430\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-9-1-1024x430.png\" alt=\"\" class=\"wp-image-386\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-9-1-1024x430.png 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-9-1-300x126.png 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-9-1-768x323.png 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-9-1-1536x645.png 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/image-9-1.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-text-align-center has-2-xs-font-size\"><em>The blue blocks in the figure above visually describe what we actually train instead of the model weights (W). For each layer of the model, we inject <strong>two low rank matrices A and B, whose product can be added to the W once the training is complete. <\/strong>At train time, we optimize the weights of these two matrices instead of W itself.&nbsp;<\/em><\/p>\n\n\n\n<p>A good introduction to LoRA can be found <a href=\"https:\/\/magazine.sebastianraschka.com\/p\/practical-tips-for-finetuning-llms?utm_source=chatgpt.com\">here<\/a>. These LoRA adapters are like lego bricks which can be added or augmented to the original un-tuned model, to achieve an optimised performance for a specific task.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Data&nbsp;<\/h3>\n\n\n\n<p><strong>We curate data from customers who have agreed with using their data for training.&nbsp;<\/strong><\/p>\n\n\n\n<p>The data is curated using the following conditions:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Conversation must have happened within the past 2 months<\/li>\n\n\n\n<li>Customers with an account in the US, with conversation locale set as \u201cEnglish\u201d<sup data-fn=\"f3915218-0ab2-405e-865a-3cc985d8702f\" class=\"fn\"><a id=\"f3915218-0ab2-405e-865a-3cc985d8702f-link\" href=\"#f3915218-0ab2-405e-865a-3cc985d8702f\">1<\/a><\/sup><\/li>\n\n\n\n<li>The conversation must have had an issue according to the older issue detection and extraction prompt.<\/li>\n<\/ul>\n\n\n\n<p>The data is cleaned for obvious hygiene issues, and then anonymised by redacting any mention of emails, addresses, account numbers, phone numbers, names, places, organisations etc.<\/p>\n\n\n\n<p>The resulting data contains 60 thousand training samples and 10 thousand validation samples.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Experiments&nbsp;<\/h3>\n\n\n\n<p>Before arriving at a final A\/B testable model, we tested several variants of open source models, starting from a lightweight Gemma 8b, Qwen3 8b, finally getting a respectable result from Qwen3 14b variant. Some offline testing results are seen below:<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Model Name<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Semantic Alignment<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Answer Rate<\/strong><\/th><\/tr><\/thead><tbody><tr><td>OpenAI: GPT 4.1&nbsp;(baseline)<\/td><td class=\"has-text-align-center\" data-align=\"center\">N\/A<\/td><td class=\"has-text-align-center\" data-align=\"center\">63.3%<\/td><\/tr><tr><td>Gemma 8b<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.850<\/td><td class=\"has-text-align-center\" data-align=\"center\">51.0%<\/td><\/tr><tr><td>Qwen 3 8b<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.900<\/td><td class=\"has-text-align-center\" data-align=\"center\">55.0%<\/td><\/tr><tr><td>Qwen 3 14b (only hard resolutions)<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.930<\/td><td class=\"has-text-align-center\" data-align=\"center\">36.4%<\/td><\/tr><tr><td>Qwen 3 14b<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>0.938<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>63.4%<\/strong><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The fine-tuned Qwen3 14b model seemed the most competent candidate out of all, performing at par with our baseline on answer rates. It is worth noting that we trained another variant of Qwen3 14b on just hard resolutions. The results were interesting to say the least, since despite learning to produce highly semantically aligned issues, the end to end answer rate performance was very poor. Upon closer examination, it seemed like the model only learned to produce an issue summary if the issue was extremely clear from the conversation, and refrained from producing any tokens when the conversation was ambiguous. This odd behaviour shows the importance of data curation in such projects.\u00a0<\/p>\n\n\n\n<h2 id=\"results\" class=\"wp-block-heading\">Results<\/h2>\n\n\n\n<p>The A\/B tests were done in two phases, since we have two tuned models active in tandem, instead of the incumbent 1 LLM call to the large model provider.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Issue Detector<\/h3>\n\n\n\n<p>We ran this A\/B test before the Issue extractor model collecting enough data for statistically significant read out.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-1 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Metric Name<\/strong><\/th><th><strong>Difference in Treatment<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Answer rate<\/td><td>-0.5 percentage points (pp)&nbsp;&nbsp;<\/td><\/tr><tr><td>P50 latency<\/td><td>-100 ms<\/td><\/tr><tr><td>CSAT<\/td><td>0<\/td><\/tr><tr><td>Cost&nbsp;<\/td><td>-5%<\/td><\/tr><\/tbody><\/table><\/figure>\n<\/div>\n<\/div>\n\n\n\n<p class=\"has-text-align-left\">The results of the issue detector model were promising, with a slight decrease in answering rates, but almost no impact on other online metrics. However, since we are not using an LLM to do the more nuanced task of detecting whether there is an addressable issue or not, this also indirectly results in a 5% reduction in cost.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Issue Extractor<\/h3>\n\n\n\n<p>Once the issue detector was shipped to production we ran an A\/B test with the winning candidate model as per the offline tests, for a week. This was enough to get enough data for statistically significant results.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Metric Name<\/strong><\/th><th><strong>Difference in Treatment<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Answer rate<\/td><td>-0.1pp&nbsp;<\/td><\/tr><tr><td>P50 latency<\/td><td>+100 ms<\/td><\/tr><tr><td>CSAT<\/td><td>0<\/td><\/tr><tr><td>Cost&nbsp;<\/td><td>-12.5%<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The overall answer rate dropped by 0.1pp. However, we saw no evidence of negative impact on other online metrics, except a slight 100 ms increase in P50 end-to-end latency. The biggest win here was the relative reduction in cost per transaction of 12.5%.<\/p>\n\n\n\n<h2 id=\"discussion-and-conclusion\" class=\"wp-block-heading\">Discussion and Conclusion<\/h2>\n\n\n\n<p>The fine-tuned models delivered substantial reduction in costs, while being competitive with state of the art models for this particular task of issue detection and extraction. We are seeing three distinct qualitative impacts on Fin\u2019s performance&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The Issue summarizer model can now focus only on summarizing issues, making the prompt much shorter. This decoupling of detection from extraction also stabilizes the prompt, as we now don\u2019t need to keep adding examples of non-issues.<\/li>\n\n\n\n<li>The new issue detection model is much more precise, removing inauthentic soft resolutions. It handles edge cases much better and makes fewer hallucinations in simple cases.<\/li>\n\n\n\n<li>The issue detector gives a probability for an issue, so we can tune exactly how many informational vs non-informational queries we want, just by tweaking the threshold. You can&#8217;t get this kind of control with vendor hosted LLMs, like the ones from OpenAI or Anthropic.<\/li>\n<\/ul>\n\n\n\n<p>While the newer approach with a smaller fine-tuned 14B LLM is significantly cheaper per transaction, there might be some more gains to be had in terms of impact on resolutions with further iterations. There are currently two running hypotheses to explore.<\/p>\n\n\n\n<p><strong>Impact of anonymisation: <\/strong>Protecting customer trust is paramount for us. Since generative models are generating tokens, training on customer data means taking utmost care to anonymise PII. To that end, in this first attempt we took an extra conservative approach by redacting every PII entity. There is a chance that this approach has negatively impacted the performance of the fine-tuned LLMs in the wild, because we are redacting important contexts at training time. We plan to improve on this approach and build a secure yet performant anonymisation strategy, with contextual replacement of PIIs instead of just redaction.<\/p>\n\n\n\n<p><strong>Impact of model size: <\/strong>The goal here was to find an optimally sized model that is light enough to minimise training\/inference cost\/infra, but large enough to be able to adapt to the task\u2019s complexity. The 14B model was the <em>first <\/em>feasible model that we found to be competent in offline tests, and hence progressed towards an A\/B test. However, research does suggest that a model&#8217;s competency to solve complex tasks scales with model parameter size [6]. Therefore, we think that it is definitely worth exploring larger models for such tasks.<\/p>\n\n\n\n<p>In conclusion, this exercise provides a great evidence for deploying fine-tuned models for Intercom scale tasks.&nbsp;<br><\/p>\n\n\n\n<h2 id=\"citations\" class=\"wp-block-heading\">Citations&nbsp;<\/h2>\n\n\n\n<p>[1] <a href=\"https:\/\/arxiv.org\/pdf\/2106.09685\">https:\/\/arxiv.org\/pdf\/2106.09685<\/a><\/p>\n\n\n\n<p>[2] <a href=\"https:\/\/magazine.sebastianraschka.com\/p\/practical-tips-for-finetuning-llms?utm_source=chatgpt.com\">https:\/\/magazine.sebastianraschka.com\/p\/practical-tips-for-finetuning-llms<\/a><\/p>\n\n\n\n<p>[3] <a href=\"https:\/\/en.wikipedia.org\/wiki\/Rank_(linear_algebra)\">https:\/\/en.wikipedia.org\/wiki\/Rank_(linear_algebra)<\/a>&nbsp;<\/p>\n\n\n\n<p>[4] <a href=\"https:\/\/arxiv.org\/pdf\/2412.13663\">https:\/\/arxiv.org\/pdf\/2412.13663<\/a>&nbsp;<\/p>\n\n\n\n<p>[5] <a href=\"https:\/\/fin.ai\/research\/does-slower-seem-smarter-rethinking-latency-in-ai-agents\/\">https:\/\/fin.ai\/research\/does-slower-seem-smarter-rethinking-latency-in-ai-agents\/<\/a>&nbsp;<\/p>\n\n\n\n<p>[6] <a href=\"https:\/\/arxiv.org\/pdf\/2001.08361\">https:\/\/arxiv.org\/pdf\/2001.08361<\/a><\/p>\n\n\n\n<h2 id=\"appendix-are-fine-tuned-models-learning-a-new-skill\" class=\"wp-block-heading\">Appendix:&nbsp; Are fine-tuned models learning a new skill? <\/h2>\n\n\n\n<p>While offline and online performance of these tuned language models do suggest that there is a lot of value in going through the fine-tuning process, one might ask: What is an unequivocal sign that the fine-tuning process is improving the model&#8217;s performance at a specific task, when compared against the original off-the-shelf base model. After all, the original off-the-shelf base model is also trained on trillions of tokens from the internet, and may contain the intrinsic intelligence to solve the task out of the box.&nbsp;<\/p>\n\n\n\n<p>The most obvious way is to compare the offline metrics before and after fine-tuning:&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Model Name<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Semantic Alignment<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Answer Rate<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Qwen 3 14B base<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.780<\/td><td class=\"has-text-align-center\" data-align=\"center\">18.0%<\/td><\/tr><tr><td>Qwen 3 14B fine-tuned<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.938<\/td><td class=\"has-text-align-center\" data-align=\"center\">63.4%<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>We see that both average semantic alignment and answer rates take a massive hit while using off the shelf base model. Particularly the answer rate drop shows that the off-the-shelf model lacks the competency to extract usable queries for the Fin\u2019s RAG pipeline.&nbsp;<br><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"789\" height=\"590\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/perplexity.png\" alt=\"\" class=\"wp-image-376\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/perplexity.png 789w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/perplexity-300x224.png 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/07\/perplexity-768x574.png 768w\" sizes=\"auto, (max-width: 789px) 100vw, 789px\" \/><\/figure>\n\n\n\n<p class=\"has-text-align-center has-2-xs-font-size\"><em>Average Perplexity computed on the generated tokens for the issue extraction task.<\/em> <em>Lower perplexity implies that the model is more sure of the tokens it generates for the same input context.<\/em> <\/p>\n\n\n\n\n\n\n<p>Another definitive sign of learning is a metric which is directly linked to the model\u2019s loss that is optimized at training time. This metric is called perplexity.&nbsp;<\/p>\n\n\n\n<p>All LLMs are trying to predict the next token, given all the tokens they have seen till that point in time. The way this is done is by predicting a vector of probabilities over all the tokens in their vocabulary, and then choosing the next token based on those probabilities. The way these models learn is by optimising these probabilities using a specific loss function called the cross entropy loss. Perplexity just averaged exponentiated cross entropy, progressively computed over all the generated tokens, given the model has seen the input context (in this case the anonymised chat history).&nbsp;<\/p>\n\n\n\n<p>$$<br>\\text{Perplexity}(T)=\\exp\\left(-\\frac{1}{N}<br>\\sum_{i=1}^{N}\\log p \\bigl(t_i \\mid t_{&lt;i}\\bigr)\\right)<br>$$<\/p>\n\n\n\n<p>The perplexity metric quantifies how much the model is surprised by seeing the next token ti, given it has seen all the previous tokens t&lt;i. When we evaluate this metric on the generated tokens for off-the-shelf Qwen 3 14B base model and for the LoRA fine-tuned models, we see that the fine-tuned model shows substantially lower perplexity on the generated tokens. Both these results indeed confirm that fine-tuning models helps them acquire competency in specific tasks.&nbsp;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n<ol class=\"wp-block-footnotes\"><li id=\"f3915218-0ab2-405e-865a-3cc985d8702f\">English is the most represented language with Fin, comprising about 80% of our traffic. Limiting to English language further narrows the problem, and controls for performance issues due to imbalanced language representation. We plan to explore multi-lingual in the future <a href=\"#f3915218-0ab2-405e-865a-3cc985d8702f-link\" aria-label=\"Jump to footnote reference 1\">\u21a9\ufe0e<\/a><\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>Are smaller fine-tuned LLMs competent for Intercom scale tasks? Large Language Models (LLMs) are a powerful tech that have turned reasoning in natural language, into a service. They&#8217;ve had a huge impact on customer support, powering&hellip;<\/p>\n","protected":false},"author":12,"featured_media":170,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"[{\"id\":\"f3915218-0ab2-405e-865a-3cc985d8702f\",\"content\":\"English is the most represented language with Fin, comprising about 80% of our traffic. Limiting to English language further narrows the problem, and controls for performance issues due to imbalanced language representation. We plan to explore multi-lingual in the future\"}]"},"categories":[11,13],"tags":[],"coauthors":[4,24],"class_list":["post-359","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-benchmarking-testing","category-llm-evaluation"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v24.6 (Yoast SEO v24.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>David vs Goliath: are small LLMs any good? - \/research<\/title>\n<meta name=\"description\" content=\"Are smaller fine-tuned LLMs competent for Intercom scale tasks?\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"David vs Goliath: are small LLMs any good?\" \/>\n<meta property=\"og:description\" content=\"Are smaller fine-tuned LLMs competent for Intercom scale tasks?\" \/>\n<meta property=\"og:url\" content=\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/\" \/>\n<meta property=\"og:site_name\" content=\"\/research\" \/>\n<meta property=\"article:published_time\" content=\"2025-09-11T22:45:50+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-09-12T09:37:30+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1344\" \/>\n\t<meta property=\"og:image:height\" content=\"896\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Sagar Joglekar, Ramil Yarullin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@intercom\" \/>\n<meta name=\"twitter:site\" content=\"@intercom\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Sagar Joglekar, Ramil Yarullin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/\"},\"author\":{\"name\":\"Sagar Joglekar\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/991e7364de666f4aac22d10ca56e03a5\"},\"headline\":\"David vs Goliath: are small LLMs any good?\",\"datePublished\":\"2025-09-11T22:45:50+00:00\",\"dateModified\":\"2025-09-12T09:37:30+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/\"},\"wordCount\":2709,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/fin.ai\/research\/#organization\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png\",\"articleSection\":[\"Benchmarking &amp; Testing\",\"LLM Evaluation\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/\",\"url\":\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/\",\"name\":\"David vs Goliath: are small LLMs any good? - \/research\",\"isPartOf\":{\"@id\":\"https:\/\/fin.ai\/research\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png\",\"datePublished\":\"2025-09-11T22:45:50+00:00\",\"dateModified\":\"2025-09-12T09:37:30+00:00\",\"description\":\"Are smaller fine-tuned LLMs competent for Intercom scale tasks?\",\"breadcrumb\":{\"@id\":\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#primaryimage\",\"url\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png\",\"contentUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png\",\"width\":1344,\"height\":896},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/fin.ai\/research\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"David vs Goliath: are small LLMs any good?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/fin.ai\/research\/#website\",\"url\":\"https:\/\/fin.ai\/research\/\",\"name\":\"Intercom.ai\",\"description\":\"Insights and blogs from the AI Group building Fin at Intercom\",\"publisher\":{\"@id\":\"https:\/\/fin.ai\/research\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/fin.ai\/research\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/fin.ai\/research\/#organization\",\"name\":\"Intercom.ai\",\"url\":\"https:\/\/fin.ai\/research\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png\",\"contentUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png\",\"width\":1024,\"height\":1024,\"caption\":\"Intercom.ai\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/intercom\",\"https:\/\/www.linkedin.com\/company\/intercom\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/991e7364de666f4aac22d10ca56e03a5\",\"name\":\"Sagar Joglekar\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/image\/3f1e00be2ba4904e967626dfad7c94b4\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/1471bc501d8c6547f9f37144c38974b9?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/1471bc501d8c6547f9f37144c38974b9?s=96&d=mm&r=g\",\"caption\":\"Sagar Joglekar\"},\"description\":\"is a Senior Machine Learning Scientist at Intercom with over 10 years of experience in applied research, data science, and software engineering.\",\"sameAs\":[\"https:\/\/sagarjoglekar.com\/\"],\"url\":\"https:\/\/fin.ai\/research\/author\/sagarjoglekar\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"David vs Goliath: are small LLMs any good? - \/research","description":"Are smaller fine-tuned LLMs competent for Intercom scale tasks?","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/","og_locale":"en_US","og_type":"article","og_title":"David vs Goliath: are small LLMs any good?","og_description":"Are smaller fine-tuned LLMs competent for Intercom scale tasks?","og_url":"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/","og_site_name":"\/research","article_published_time":"2025-09-11T22:45:50+00:00","article_modified_time":"2025-09-12T09:37:30+00:00","og_image":[{"width":1344,"height":896,"url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png","type":"image\/png"}],"author":"Sagar Joglekar, Ramil Yarullin","twitter_card":"summary_large_image","twitter_creator":"@intercom","twitter_site":"@intercom","twitter_misc":{"Written by":"Sagar Joglekar, Ramil Yarullin","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#article","isPartOf":{"@id":"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/"},"author":{"name":"Sagar Joglekar","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/991e7364de666f4aac22d10ca56e03a5"},"headline":"David vs Goliath: are small LLMs any good?","datePublished":"2025-09-11T22:45:50+00:00","dateModified":"2025-09-12T09:37:30+00:00","mainEntityOfPage":{"@id":"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/"},"wordCount":2709,"commentCount":0,"publisher":{"@id":"https:\/\/fin.ai\/research\/#organization"},"image":{"@id":"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#primaryimage"},"thumbnailUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png","articleSection":["Benchmarking &amp; Testing","LLM Evaluation"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/","url":"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/","name":"David vs Goliath: are small LLMs any good? - \/research","isPartOf":{"@id":"https:\/\/fin.ai\/research\/#website"},"primaryImageOfPage":{"@id":"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#primaryimage"},"image":{"@id":"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#primaryimage"},"thumbnailUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png","datePublished":"2025-09-11T22:45:50+00:00","dateModified":"2025-09-12T09:37:30+00:00","description":"Are smaller fine-tuned LLMs competent for Intercom scale tasks?","breadcrumb":{"@id":"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#primaryimage","url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png","contentUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png","width":1344,"height":896},{"@type":"BreadcrumbList","@id":"https:\/\/fin.ai\/research\/david-vs-goliath-are-small-llms-any-good\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/fin.ai\/research\/"},{"@type":"ListItem","position":2,"name":"David vs Goliath: are small LLMs any good?"}]},{"@type":"WebSite","@id":"https:\/\/fin.ai\/research\/#website","url":"https:\/\/fin.ai\/research\/","name":"Intercom.ai","description":"Insights and blogs from the AI Group building Fin at Intercom","publisher":{"@id":"https:\/\/fin.ai\/research\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/fin.ai\/research\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/fin.ai\/research\/#organization","name":"Intercom.ai","url":"https:\/\/fin.ai\/research\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/","url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png","contentUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png","width":1024,"height":1024,"caption":"Intercom.ai"},"image":{"@id":"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/intercom","https:\/\/www.linkedin.com\/company\/intercom"]},{"@type":"Person","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/991e7364de666f4aac22d10ca56e03a5","name":"Sagar Joglekar","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/image\/3f1e00be2ba4904e967626dfad7c94b4","url":"https:\/\/secure.gravatar.com\/avatar\/1471bc501d8c6547f9f37144c38974b9?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1471bc501d8c6547f9f37144c38974b9?s=96&d=mm&r=g","caption":"Sagar Joglekar"},"description":"is a Senior Machine Learning Scientist at Intercom with over 10 years of experience in applied research, data science, and software engineering.","sameAs":["https:\/\/sagarjoglekar.com\/"],"url":"https:\/\/fin.ai\/research\/author\/sagarjoglekar\/"}]}},"_links":{"self":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts\/359","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/users\/12"}],"replies":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/comments?post=359"}],"version-history":[{"count":0,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts\/359\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/media\/170"}],"wp:attachment":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/media?parent=359"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/categories?post=359"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/tags?post=359"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/coauthors?post=359"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}