Cost of Serving LLMs

2025.09.10

TLDR: we explored AWS hardware options (and serving engines), and it turned out that a self-serving LLM can be significantly more cost-effective than commercial APIs.

Introduction

Fin is an advanced customer support AI agent powered by a combination of large language models (LLMs) and retrieval-augmented generation (RAG) techniques. This hybrid engine enables Fin to deliver accurate, context-aware responses. However, Fin often relies on generalist LLMs provided by vendors such as OpenAI and Anthropic—solutions that can be both expensive and slow due to the size and complexity of the models.

We wanted to explore whether certain tasks currently handled by these large external models could instead be managed by smaller, fine-tuned LLMs that we host ourselves. Our goal was to evaluate the feasibility and cost-effectiveness of serving these models on our own AWS infrastructure.

This analysis involved:

Exploring AWS hardware options and serving engines
Benchmarking various open-weight models
Estimating cost differences between self-hosting and API usage

Cost Comparison: Hosted vs. Provider

Comparing the cost of self-hosting LLMs versus using provider APIs is not straightforward. The pricing models differ fundamentally:

Self-hosting: You pay for infrastructure (e.g., GPU instances) over time.
Provider APIs: You pay per token processed, with separate pricing for prompt, completion, and cached tokens.

In AWS, costs vary significantly depending on whether instances are reserved in advance or acquired on demand—reservation can be 2–3× cheaper. To fairly compare costs, we estimated the number of reserved instances required to serve our existing traffic reliably.

Instead of using “requests per second” (common for traditional web services), we measured tokens per second (TPS), as token usage can vary greatly per request in LLM services.

We used two key data points:

Maximum sustainable token throughput per instance per model
Peak token usage per minute from historical Fin traffic

This allowed us to estimate the number of nodes or GPUs needed per feature and model, and ultimately the cost to serve each model.

Benchmarking Setup

We benchmarked a variety of open-weight LLMs across multiple AWS hardware configurations. The models included:

Qwen 3: 14B, 30B, 235B
Gemma 3: 4B, 27B
DeepSeek Llama: 8B, 70B (distilled)

We tested these models using real Fin prompts and organized them into three datasets, categorized by prompt complexity and model size.

Hardware tested:

For simplicity and to allow increased flexibility, we used EC2 instances for the benchmarks:

g6e (L40s GPUs)
p4d/p4de (A100)
p5/p5e (H100/H200)

Inference engines tested:

There are many available inference engines that can be used today, but we primarily looked into the most popular ones: VLLM, SGLang and TensorRT. However, because TensorRT did not support Qwen 3 and Gemma 3 models at the time, we went forward only with VLLM and SGLang. Both engines showed comparable performance across configurations using default settings.

Latency constraints:

Time to First Token (TTFT): ≤ 250ms
Inter-Token Latency (ITL): ≤ 30ms

These targets were based on Fin’s production latency requirements. Our goal was to maximise TPS without exceeding these latency thresholds.

Results

We compared our internal serving cost estimates against popular provider models:

GPT-4.1 mini
GPT-4.1
Anthropic Sonnet 3.7

We calculated monthly token usage per dataset (prompt, output, cached) to estimate provider costs. We then benchmarked each model across all hardware configurations using both VLLM and SGLang.

Cost Ratio Comparison

Model	Cost vs GPT-4.1	Cost vs GPT-4.1 mini	Cost vs Sonnet 3.7
Gemma 3 4B	0.04	0.20	0.01
DeepSeek Llama 8B	0.05	0.27	0.01
Qwen 3 14B	0.05	0.27	0.01
Gemma 3 27B	0.34	1.71	0.08
Qwen 3 30B	0.42	2.12	0.10
DeepSeek Llama 70B	1.70	8.49	1.10
Qwen 3 235B	2.17	10.83	1.40

Key insight: Smaller models (<14B) were significantly cheaper to serve than GPT-4.1 mini, while mid-sized models like Gemma 27B and Qwen 30B were more affordable than GPT-4.1.

Hardware Cost Efficiency (Qwen 3 14B)

Hardware	Cost/hr	Max TPS	TPS/Cost Ratio
1×L40s	$2.37	34,000	14,345.99
4×L40s	$10.48	36,000	3,435.11
1×A100	$1.91	70,000	36,649.21
1×H100	$3.90	135,000	34,615.38

Observations:

g6e (L40s): Multi-GPU setups lacked NVLink and relied on PCIe, making scaling inefficient.
p4/p5 (A100/H100): NVLink-enabled bandwidth allowed efficient tensor parallelism, boosting performance and cost-efficiency.
A100 80GB: Outperformed L40s in both performance and cost.
H100/H200: Though expensive, delivered high throughput and favourable cost/TPS ratios.

Conclusion

Our benchmarks showed that self-hosting smaller and medium-sized LLMs can be significantly more cost-effective than relying on commercial APIs—especially with models under 30B parameters. To serve larger models efficiently, further work is needed to optimise engine performance. While EC2 provided flexibility for benchmarking, it’s not ideal for production. We plan to transition to EKS (Elastic Kubernetes Service) and potentially Sagemaker Hyperpod for live model serving.

About the author

Stefan Ivanovici is a Senior Product Engineer in Intercom's AI Team. Most of his current work is related to fine-tuning LLMs used in Fin and other parts of Intercom product.

Cost of Serving LLMs

Introduction

Cost Comparison: Hosted vs. Provider

Benchmarking Setup

Results

Cost Ratio Comparison

Hardware Cost Efficiency (Qwen 3 14B)

Conclusion

About the author

Related Articles

Building out Intercom’s AI infra

Building a Better Language Detection Model for Fin

We Don’t Need Higher Peak Intelligence, Only More Intelligence Density

Fin: Running a Reliable Service over Unreliable Parts