Vector databases have exploded in the past two years. There are both open-source and managed options, including Pinecone, Milvus, Qdrant, and Weaviate, which often claim greater scale, flexibility, and speed than traditional search platforms.
When we set out to design our AI retrieval systems, we began with first principles, not vendor promises. Our research showed that Elasticsearch struck the right balance of performance, cost, operability, and long-term maintainability – at least for our needs, and given our experience operating it at scale.
In this post we share our rationale, and two years of operational results running Fin, the market’s most advanced AI agent.
Initial System: In-Memory Retrieval
In early versions, each customer’s content and embeddings lived on S3. For each inference request, we:
- Retrieved the customer’s S3 object
- Loaded all vectors into memory
- Ran a brute-force KNN search
This design was a good starting point:
- The “database” could handle any number of requests or spikes
- Iteration was simple
- Our scientists could experiment easily, leveraging familiar tools like notebooks and pandas
It also was battle-tested to some degree, having served our previous ‘Resolution Bot’ product for years (which used much smaller datasets).
But as we added support for new content types, and began generating content from conversations, the number of embeddings for each customer soared. Soon, S3 download and deserialization times began to dominate request latency— we started to see spikes of 15 seconds just to load vectors for our largest customers, before even starting the search or LLM inference.
Requirements & Constraints
Prior to committing to a particular technology, we defined concrete constraints:
- Scale: Target of 100M+ embeddings of 768 dimensions, accounting for accelerating growth
- Cost: Consider infrastructure and ongoing operational burden
- Filtering support: Must be able to filter retrieval (by language, content type, audience, etc. to support our platform and permissions)
Other nice-to-haves included:
- Run Less Software: We want to avoid undifferentiated heavy lifts
- Full text search: We might want to move to hybrid search in the future
- Predictability: Known scaling properties, failure modes, and monitoring
Looking back, even our optimistic scale estimate of 100 million embeddings was much too low!
Surveying the Landscape
We evaluated Pinecone, Milvus, Qdrant, Weaviate, and Elasticsearch. All had the basic features we required, so cost initially seemed like the main differentiator. At the time, we had about 20 million embeddings; we projected costs for 5x that scale—100 million embeddings.
Several promising open-source solutions offered theoretical top-end performance, but with potentially increased operability risks. And the managed services tended to be more expensive — Pinecone and Milvus, for example, were estimated to be 2x the cost of their open-source alternatives.
In the end, the primary differentiator for us was operational familiarity and proven reliability. Elasticsearch, already core to several business-critical Intercom systems, presented clear advantages here.
Benchmarking: Vector Search in Elasticsearch
We began to look at standard benchmarks:
- Elasticsearch nightly vector search benchmarks (2M embeddings, 768D): search latency between 100ms (nightly-so_vector-script-score-query-match-all-latency) and 200ms (nightly-so_vector-script-score-query-acceptedAnswerId-filter71%-latency).
- ANN-Benchmarks: Elasticsearch sat in the center of the pack, neither the fastest nor the slowest.
And verified these against our in-house benchmarks on production-like datasets, which gave similar results:
Vectors Searched | Exact KNN Latency | Approximate KNN Latency |
20,000 | ~20ms | < 10ms |
300,000 | ~100ms | < 15ms |
Overall, this performance was good enough for us. For an AI Agent, the bottleneck will generally be the LLM’s latency, which is measured in seconds. Even if search latency dropped to zero, we only stand to save ~200ms. This would be imperceptible to the user in most cases.
Decision Criteria: Why We Chose Elasticsearch
Beyond raw benchmark numbers, the following factored heavily in the final decision:
Low Infrastructure Cost
Just accounting for the infrastructure, we estimated Elasticsearch would cost us the least to run and the gap would widen as we scaled up, especially compared to the Managed offerings.
Initially, it would be even cheaper to get started as we planned to run on a shared large scale Elasticsearch cluster that Intercom used for multiple use cases. And we could migrate to a separate infrastructure when necessary.
Low Onboarding Cost
Since we would not be running a new distributed system, we didn’t need new runbooks and could avoid common scaling/maintenance issues and failure modes.
We could also reuse most of the tooling around running a database: disaster recovery, snapshots, replication.
Subject-Matter Expertise
Elasticsearch is one of the “core technologies” at Intercom. We have experience running it and have subject matter experts within the company (1, 2) who have familiarity with how ES scales. They can proactively manage cluster health, be able to tweak capacity or performance by playing with multiple levers like the number of shards in an index, size of the data nodes etc.
Filterable Hybrid Search
The ability to compose vector and structured filters with existing ES query DSL is a requirement most vector-first DBs do not easily fulfill. With Elasticsearch, we can combine structured filters and vector search with full-text search to improve relevance of the results in the future.
Possibility to move to Approximate KNN in the Future
Since Exact KNN performed adequately for us, we could start with it and avoid having to worry about recall. This simplified the migration. In the future, we could move to ANN for faster results.
Production Results
Two years later, our scale exceeded even our most ambitious forecasts by 3x—a welcome problem.
Dimension | Estimate | Current Scale |
Number of Embeddings | 100 Million (768D) | 300 Million (1024D) |
Cost per month | $4-8k (unblended) | ~$12k (unblended), ~$6k actual (with reservations) |
Indexing (requests/minute) | – | Stable: ~800 Peak: ~15k |
Search (requests/minute) | – | Stable: ~5k Peak: ~14k |
Search Latency | ~100ms – 200ms | Avg: ~30ms P90: ~50ms P99: ~200ms |
Note: The actual number of embeddings we have is ~600 Million as we are continually experimenting with different embedding models or chunking strategies. The cost here is for this total architecture: 600 Million embeddings, replicated once, with raw data and metadata. Since we don’t currently create an index for ANN, which usually requires as much memory/disk space as the size of embeddings, we use 300 Million as the number for a fair cost comparison with other vendors.
Compared to other vendors, this setup costs us at least 3 times less (comparing just the unblended cost for fairness). For example: Qdrant’s estimate is ~$30k/month (not accounting for capacity needed for experimentation). Pinecone would be even costlier.
Specifics of our Architecture
- Architecture
- Data nodes: 9 x i4g.4xlarge
- Client/Master nodes: 3 each x c6g.xlarge
- Index
- Embeddings are partitioned into 15 indexes, each with 30 primary shards and a replica.
- Refresh Interval: 2 seconds
- The embeddings field is a dense_vector with index=False.
- Ingestion
- Ingest on every create/update.
- Average request processes 3 contents (articles, file etc.), which are chunked and bulk-ingested.
- In the future, we can buffer content updates for a few seconds before processing them to improve ingestion efficiency if needed.
- Querying
- Exact vector-search using script_score and functions for vector fields.
- Queries use filters for locale, content that should be visible/invisible to a user etc.
Business Impact
The migration from S3/in-memory to Elasticsearch drove customer-visible response time improvements:
Duration | Average (Improvement) | P95 (Improvement) |
All Customers | ~25% | ~15% |
Large Customers | ~56% | ~51% |

The system has also absorbed 10x increases in customer data volume and query traffic without architectural change, cluster downtime, or significant operational incident. Most importantly, all operational practices—from tuning, upgrades, snapshotting, disaster recovery, to scaling—were well understood from prior ES experience.
Conclusion: Practical Takeaways
When picking tools or infrastructure—whether it’s for vector search, databases, or anything else—these principles worked well for us:
- Leverage what your team already knows. Familiar tools let you move faster, reduce onboarding, incidents, and painful outages. If your team already has expertise in a given technology, that’s a huge advantage.
- Expect trade-offs. Cutting-edge technology (like vector databases) might promise slightly lower latency or higher throughput, but come with new operational risks, documentation gaps, and complex migrations. Make sure those trade-offs are worth it for your business.
- Don’t underestimate the costs outside infrastructure. The time spent learning, debugging, and supporting a new system often outweighs the dollar savings or benchmark wins.
Our experience is not a call to avoid new technology that’s better suited for your situation, but to reassure you that a “boring” choice might not necessarily be the wrong one.