/research
Insights and blogs from the AI Group building Fin at Intercom
Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity
In autoregressive decoding, each token requires repeatedly reading the KV cache from memory, and this cost scales linearly with sequence length, layers, and head count. This post introduces Low-Rank Key-Value (LRKV) attention, a drop-in modification to multi-head attention that reduces KV cache size by 45–53% vs standard MHA, while achieving lower test loss across model scales (128M → 6.3B), faster convergence in training steps, and stronger downstream performance after supervised midtraining.
ReadArticles
21Podcast EP2: Shipping reliable AI actions
2025.09.19
Podcast EP1: Closing the loop
2025.09.18
How We Built a World-Class Reranker for Fin
2025.09.11
Finetuning Retrieval for Fin
2025.09.11
David vs Goliath: are small LLMs any good?
2025.09.11
Building out Intercom’s AI infra
2025.09.11
Cost of Serving LLMs
2025.09.10
Think Fast: Reasoning at 3ms a Token
2025.07.18
Do you really need a Vector Search Database?
2025.04.29