Do you really need a Vector Search Database?

Reflections on Intercom’s decision to stick with Elasticsearch

2025.04.29

Vector databases have exploded in the past two years. There are both open-source and managed options, including Pinecone, Milvus, Qdrant, and Weaviate, which often claim greater scale, flexibility, and speed than traditional search platforms.

When we set out to design our AI retrieval systems, we began with first principles, not vendor promises. Our research showed that Elasticsearch struck the right balance of performance, cost, operability, and long-term maintainability – at least for our needs, and given our experience operating it at scale.

In this post we share our rationale, and two years of operational results running Fin, the market’s most advanced AI agent.

Initial System: In-Memory Retrieval

In early versions, each customer’s content and embeddings lived on S3. For each inference request, we:

Retrieved the customer’s S3 object
Loaded all vectors into memory
Ran a brute-force KNN search

This design was a good starting point:

The “database” could handle any number of requests or spikes
Iteration was simple
Our scientists could experiment easily, leveraging familiar tools like notebooks and pandas

It also was battle-tested to some degree, having served our previous ‘Resolution Bot’ product for years (which used much smaller datasets).

But as we added support for new content types, and began generating content from conversations, the number of embeddings for each customer soared. Soon, S3 download and deserialization times began to dominate request latency— we started to see spikes of 15 seconds just to load vectors for our largest customers, before even starting the search or LLM inference.

Requirements & Constraints

Prior to committing to a particular technology, we defined concrete constraints:

Scale: Target of 100M+ embeddings of 768 dimensions, accounting for accelerating growth
Cost: Consider infrastructure and ongoing operational burden
Filtering support: Must be able to filter retrieval (by language, content type, audience, etc. to support our platform and permissions)

Other nice-to-haves included:

Run Less Software: We want to avoid undifferentiated heavy lifts
Full text search: We might want to move to hybrid search in the future
Predictability: Known scaling properties, failure modes, and monitoring

Looking back, even our optimistic scale estimate of 100 million embeddings was much too low!

Surveying the Landscape

We evaluated Pinecone, Milvus, Qdrant, Weaviate, and Elasticsearch. All had the basic features we required, so cost initially seemed like the main differentiator. At the time, we had about 20 million embeddings; we projected costs for 5x that scale—100 million embeddings.

Several promising open-source solutions offered theoretical top-end performance, but with potentially increased operability risks. And the managed services tended to be more expensive — Pinecone and Milvus, for example, were estimated to be 2x the cost of their open-source alternatives.

In the end, the primary differentiator for us was operational familiarity and proven reliability. Elasticsearch, already core to several business-critical Intercom systems, presented clear advantages here.

Benchmarking: Vector Search in Elasticsearch

We began to look at standard benchmarks:

Elasticsearch nightly vector search benchmarks (2M embeddings, 768D): search latency between 100ms (nightly-so_vector-script-score-query-match-all-latency) and 200ms (nightly-so_vector-script-score-query-acceptedAnswerId-filter71%-latency).
ANN-Benchmarks: Elasticsearch sat in the center of the pack, neither the fastest nor the slowest.

And verified these against our in-house benchmarks on production-like datasets, which gave similar results:

Vectors Searched	Exact KNN Latency	Approximate KNN Latency
20,000	~20ms	< 10ms
300,000	~100ms	< 15ms

Overall, this performance was good enough for us. For an AI Agent, the bottleneck will generally be the LLM’s latency, which is measured in seconds. Even if search latency dropped to zero, we only stand to save ~200ms. This would be imperceptible to the user in most cases.

Decision Criteria: Why We Chose Elasticsearch

Beyond raw benchmark numbers, the following factored heavily in the final decision:

Low Infrastructure Cost

Just accounting for the infrastructure, we estimated Elasticsearch would cost us the least to run and the gap would widen as we scaled up, especially compared to the Managed offerings.

Initially, it would be even cheaper to get started as we planned to run on a shared large scale Elasticsearch cluster that Intercom used for multiple use cases. And we could migrate to a separate infrastructure when necessary.

Low Onboarding Cost

Since we would not be running a new distributed system, we didn’t need new runbooks and could avoid common scaling/maintenance issues and failure modes.

We could also reuse most of the tooling around running a database: disaster recovery, snapshots, replication.

Subject-Matter Expertise

Elasticsearch is one of the “core technologies” at Intercom. We have experience running it and have subject matter experts within the company (1, 2) who have familiarity with how ES scales. They can proactively manage cluster health, be able to tweak capacity or performance by playing with multiple levers like the number of shards in an index, size of the data nodes etc.

Filterable Hybrid Search

The ability to compose vector and structured filters with existing ES query DSL is a requirement most vector-first DBs do not easily fulfill. With Elasticsearch, we can combine structured filters and vector search with full-text search to improve relevance of the results in the future.

Possibility to move to Approximate KNN in the Future

Since Exact KNN performed adequately for us, we could start with it and avoid having to worry about recall. This simplified the migration. In the future, we could move to ANN for faster results.

Production Results

Two years later, our scale exceeded even our most ambitious forecasts by 3x—a welcome problem.

Dimension	Estimate	Current Scale
Number of Embeddings	100 Million (768D)	300 Million (1024D)
Cost per month	$4-8k (unblended)	~$12k (unblended), ~$6k actual (with reservations)
Indexing (requests/minute)	–	Stable: ~800 Peak: ~15k
Search (requests/minute)	–	Stable: ~5k Peak: ~14k
Search Latency	~100ms – 200ms	Avg: ~30ms P90: ~50ms P99: ~200ms

Note: The actual number of embeddings we have is ~600 Million as we are continually experimenting with different embedding models or chunking strategies. The cost here is for this total architecture: 600 Million embeddings, replicated once, with raw data and metadata. Since we don’t currently create an index for ANN, which usually requires as much memory/disk space as the size of embeddings, we use 300 Million as the number for a fair cost comparison with other vendors.

Compared to other vendors, this setup costs us at least 3 times less (comparing just the unblended cost for fairness). For example: Qdrant’s estimate is ~$30k/month (not accounting for capacity needed for experimentation). Pinecone would be even costlier.

Specifics of our Architecture

Architecture
- Data nodes: 9 x i4g.4xlarge
- Client/Master nodes: 3 each x c6g.xlarge
Index
- Embeddings are partitioned into 15 indexes, each with 30 primary shards and a replica.
- Refresh Interval: 2 seconds
- The embeddings field is a dense_vector with index=False.
Ingestion
- Ingest on every create/update.
- Average request processes 3 contents (articles, file etc.), which are chunked and bulk-ingested.
- In the future, we can buffer content updates for a few seconds before processing them to improve ingestion efficiency if needed.
Querying
- Exact vector-search using script_score and functions for vector fields.
- Queries use filters for locale, content that should be visible/invisible to a user etc.

Business Impact

The migration from S3/in-memory to Elasticsearch drove customer-visible response time improvements:

Duration	Average (Improvement)	P95 (Improvement)
All Customers	~25%	~15%
Large Customers	~56%	~51%

Improvement from our migration in 2023

The speedup was acutely felt by the customers who entrusted Fin with more of their content, where the response times were slashed by half.

The system has also absorbed 10x increases in customer data volume and query traffic without architectural change, cluster downtime, or significant operational incident. Most importantly, all operational practices—from tuning, upgrades, snapshotting, disaster recovery, to scaling—were well understood from prior ES experience.

Conclusion: Practical Takeaways

When picking tools or infrastructure—whether it’s for vector search, databases, or anything else—these principles worked well for us:

Leverage what your team already knows. Familiar tools let you move faster, reduce onboarding, incidents, and painful outages. If your team already has expertise in a given technology, that’s a huge advantage.
Expect trade-offs. Cutting-edge technology (like vector databases) might promise slightly lower latency or higher throughput, but come with new operational risks, documentation gaps, and complex migrations. Make sure those trade-offs are worth it for your business.
Don’t underestimate the costs outside infrastructure. The time spent learning, debugging, and supporting a new system often outweighs the dollar savings or benchmark wins.

Our experience is not a call to avoid new technology that’s better suited for your situation, but to reassure you that a “boring” choice might not necessarily be the wrong one.

About the author

Ketan Bhatt is a Staff Product Engineer on Intercom's AI team. He is responsible for Fin's reliability and supports scientists in the AI team with their projects.