TLDR: we explored AWS hardware options (and serving engines), and it turned out that a self-serving LLM can be significantly more cost-effective than commercial APIs.
Introduction
Fin is an advanced customer support AI agent powered by a combination of large language models (LLMs) and retrieval-augmented generation (RAG) techniques. This hybrid engine enables Fin to deliver accurate, context-aware responses. However, Fin often relies on generalist LLMs provided by vendors such as OpenAI and Anthropic—solutions that can be both expensive and slow due to the size and complexity of the models.
We wanted to explore whether certain tasks currently handled by these large external models could instead be managed by smaller, fine-tuned LLMs that we host ourselves. Our goal was to evaluate the feasibility and cost-effectiveness of serving these models on our own AWS infrastructure.
This analysis involved:
- Exploring AWS hardware options and serving engines
- Benchmarking various open-weight models
- Estimating cost differences between self-hosting and API usage
Cost Comparison: Hosted vs. Provider
Comparing the cost of self-hosting LLMs versus using provider APIs is not straightforward. The pricing models differ fundamentally:
- Self-hosting: You pay for infrastructure (e.g., GPU instances) over time.
- Provider APIs: You pay per token processed, with separate pricing for prompt, completion, and cached tokens.
In AWS, costs vary significantly depending on whether instances are reserved in advance or acquired on demand—reservation can be 2–3× cheaper. To fairly compare costs, we estimated the number of reserved instances required to serve our existing traffic reliably.
Instead of using “requests per second” (common for traditional web services), we measured tokens per second (TPS), as token usage can vary greatly per request in LLM services.
We used two key data points:
- Maximum sustainable token throughput per instance per model
- Peak token usage per minute from historical Fin traffic
This allowed us to estimate the number of nodes or GPUs needed per feature and model, and ultimately the cost to serve each model.
Benchmarking Setup
We benchmarked a variety of open-weight LLMs across multiple AWS hardware configurations. The models included:
- Qwen 3: 14B, 30B, 235B
- Gemma 3: 4B, 27B
- DeepSeek Llama: 8B, 70B (distilled)
We tested these models using real Fin prompts and organized them into three datasets, categorized by prompt complexity and model size.
Hardware tested:
For simplicity and to allow increased flexibility, we used EC2 instances for the benchmarks:
- g6e (L40s GPUs)
- p4d/p4de (A100)
- p5/p5e (H100/H200)
Inference engines tested:
There are many available inference engines that can be used today, but we primarily looked into the most popular ones: VLLM, SGLang and TensorRT. However, because TensorRT did not support Qwen 3 and Gemma 3 models at the time, we went forward only with VLLM and SGLang. Both engines showed comparable performance across configurations using default settings.
Latency constraints:
- Time to First Token (TTFT): ≤ 250ms
- Inter-Token Latency (ITL): ≤ 30ms
These targets were based on Fin’s production latency requirements. Our goal was to maximise TPS without exceeding these latency thresholds.
Results
We compared our internal serving cost estimates against popular provider models:
- GPT-4.1 mini
- GPT-4.1
- Anthropic Sonnet 3.7
We calculated monthly token usage per dataset (prompt, output, cached) to estimate provider costs. We then benchmarked each model across all hardware configurations using both VLLM and SGLang.
Cost Ratio Comparison
Model | Cost vs GPT-4.1 | Cost vs GPT-4.1 mini | Cost vs Sonnet 3.7 |
---|---|---|---|
Gemma 3 4B | 0.04 | 0.20 | 0.01 |
DeepSeek Llama 8B | 0.05 | 0.27 | 0.01 |
Qwen 3 14B | 0.05 | 0.27 | 0.01 |
Gemma 3 27B | 0.34 | 1.71 | 0.08 |
Qwen 3 30B | 0.42 | 2.12 | 0.10 |
DeepSeek Llama 70B | 1.70 | 8.49 | 1.10 |
Qwen 3 235B | 2.17 | 10.83 | 1.40 |
Key insight: Smaller models (<14B) were significantly cheaper to serve than GPT-4.1 mini, while mid-sized models like Gemma 27B and Qwen 30B were more affordable than GPT-4.1.
Hardware Cost Efficiency (Qwen 3 14B)
Hardware | Cost/hr | Max TPS | TPS/Cost Ratio |
---|---|---|---|
1×L40s | $2.37 | 34,000 | 14,345.99 |
4×L40s | $10.48 | 36,000 | 3,435.11 |
1×A100 | $1.91 | 70,000 | 36,649.21 |
1×H100 | $3.90 | 135,000 | 34,615.38 |
Observations:
- g6e (L40s): Multi-GPU setups lacked NVLink and relied on PCIe, making scaling inefficient.
- p4/p5 (A100/H100): NVLink-enabled bandwidth allowed efficient tensor parallelism, boosting performance and cost-efficiency.
- A100 80GB: Outperformed L40s in both performance and cost.
- H100/H200: Though expensive, delivered high throughput and favourable cost/TPS ratios.
Conclusion
Our benchmarks showed that self-hosting smaller and medium-sized LLMs can be significantly more cost-effective than relying on commercial APIs—especially with models under 30B parameters. To serve larger models efficiently, further work is needed to optimise engine performance. While EC2 provided flexibility for benchmarking, it’s not ideal for production. We plan to transition to EKS (Elastic Kubernetes Service) and potentially Sagemaker Hyperpod for live model serving.