Home

Cost of Serving LLMs

Cost of Serving LLMs

Stefan Ivanovici
2025.09.10

TLDR: we explored AWS hardware options (and serving engines), and it turned out that a self-serving LLM can be significantly more cost-effective than commercial APIs.

Introduction

Fin is an advanced customer support AI agent powered by a combination of large language models (LLMs) and retrieval-augmented generation (RAG) techniques. This hybrid engine enables Fin to deliver accurate, context-aware responses. However, Fin often relies on generalist LLMs provided by vendors such as OpenAI and Anthropic—solutions that can be both expensive and slow due to the size and complexity of the models.

We wanted to explore whether certain tasks currently handled by these large external models could instead be managed by smaller, fine-tuned LLMs that we host ourselves. Our goal was to evaluate the feasibility and cost-effectiveness of serving these models on our own AWS infrastructure.

This analysis involved:

  • Exploring AWS hardware options and serving engines
  • Benchmarking various open-weight models
  • Estimating cost differences between self-hosting and API usage

Cost Comparison: Hosted vs. Provider

Comparing the cost of self-hosting LLMs versus using provider APIs is not straightforward. The pricing models differ fundamentally:

  • Self-hosting: You pay for infrastructure (e.g., GPU instances) over time.
  • Provider APIs: You pay per token processed, with separate pricing for prompt, completion, and cached tokens.

In AWS, costs vary significantly depending on whether instances are reserved in advance or acquired on demand—reservation can be 2–3× cheaper. To fairly compare costs, we estimated the number of reserved instances required to serve our existing traffic reliably.

Instead of using “requests per second” (common for traditional web services), we measured tokens per second (TPS), as token usage can vary greatly per request in LLM services.

We used two key data points:

  • Maximum sustainable token throughput per instance per model
  • Peak token usage per minute from historical Fin traffic

This allowed us to estimate the number of nodes or GPUs needed per feature and model, and ultimately the cost to serve each model.

Benchmarking Setup

We benchmarked a variety of open-weight LLMs across multiple AWS hardware configurations. The models included:

  • Qwen 3: 14B, 30B, 235B
  • Gemma 3: 4B, 27B
  • DeepSeek Llama: 8B, 70B (distilled)

We tested these models using real Fin prompts and organized them into three datasets, categorized by prompt complexity and model size.

Hardware tested:

For simplicity and to allow increased flexibility, we used EC2 instances for the benchmarks:

  • g6e (L40s GPUs)
  • p4d/p4de (A100) 
  • p5/p5e (H100/H200)

Inference engines tested:

There are many available inference engines that can be used today, but we primarily looked into the most popular ones: VLLM, SGLang and TensorRT. However, because TensorRT did not support Qwen 3 and Gemma 3 models at the time, we went forward only with VLLM and SGLang. Both engines showed comparable performance across configurations using default settings.

Latency constraints:

  • Time to First Token (TTFT): ≤ 250ms
  • Inter-Token Latency (ITL): ≤ 30ms

These targets were based on Fin’s production latency requirements. Our goal was to maximise TPS without exceeding these latency thresholds.

Results

We compared our internal serving cost estimates against popular provider models:

  • GPT-4.1 mini
  • GPT-4.1
  • Anthropic Sonnet 3.7

We calculated monthly token usage per dataset (prompt, output, cached) to estimate provider costs. We then benchmarked each model across all hardware configurations using both VLLM and SGLang.

Cost Ratio Comparison

ModelCost vs GPT-4.1Cost vs GPT-4.1 miniCost vs Sonnet 3.7
Gemma 3 4B0.040.200.01
DeepSeek Llama 8B0.050.270.01
Qwen 3 14B0.050.270.01
Gemma 3 27B0.341.710.08
Qwen 3 30B0.422.120.10
DeepSeek Llama 70B1.708.491.10
Qwen 3 235B2.1710.831.40

Key insight: Smaller models (<14B) were significantly cheaper to serve than GPT-4.1 mini, while mid-sized models like Gemma 27B and Qwen 30B were more affordable than GPT-4.1.

Hardware Cost Efficiency (Qwen 3 14B)

HardwareCost/hrMax TPSTPS/Cost Ratio
1×L40s$2.3734,00014,345.99
4×L40s$10.4836,0003,435.11
1×A100$1.9170,00036,649.21
1×H100$3.90135,00034,615.38

Observations:

  • g6e (L40s): Multi-GPU setups lacked NVLink and relied on PCIe, making scaling inefficient.
  • p4/p5 (A100/H100): NVLink-enabled bandwidth allowed efficient tensor parallelism, boosting performance and cost-efficiency.
  • A100 80GB: Outperformed L40s in both performance and cost.
  • H100/H200: Though expensive, delivered high throughput and favourable cost/TPS ratios.

Conclusion

Our benchmarks showed that self-hosting smaller and medium-sized LLMs can be significantly more cost-effective than relying on commercial APIs—especially with models under 30B parameters. To serve larger models efficiently, further work is needed to optimise engine performance. While EC2 provided flexibility for benchmarking, it’s not ideal for production. We plan to transition to EKS (Elastic Kubernetes Service) and potentially Sagemaker Hyperpod for live model serving.

Get notified when we post on /research.

About the author

Stefan Ivanovici is a Senior Product Engineer in Intercom's AI Team. Most of his current work is related to fine-tuning LLMs used in Fin and other parts of Intercom product.

Related Articles