/research

Insights and blogs from the AI Group building Fin at Intercom

Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity

In autoregressive decoding, each token requires repeatedly reading the KV cache from memory, and this cost scales linearly with sequence length, layers, and head count. This post introduces Low-Rank Key-Value (LRKV) attention, a drop-in modification to multi-head attention that reduces KV cache size by 45–53% vs standard MHA, while achieving lower test loss across model scales (128M → 6.3B), faster convergence in training steps, and stronger downstream performance after supervised midtraining.

James O'Neill

Read

/research

Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity

Articles

Unsupervised Learning Meets Generative AI: Topic Modelling for Real-World Dialogue

Podcast EP2: Shipping reliable AI actions

Podcast EP1: Closing the loop

How We Built a World-Class Reranker for Fin

Using LLMs as a Reranker for RAG: A Practical Guide

Finetuning Retrieval for Fin

David vs Goliath: are small LLMs any good?

Building out Intercom’s AI infra

“Was that helpful?” Understanding User Feedback in Customer Support AI Agents

To escalate, or not to escalate, that is the question

Building a Better Language Detection Model for Fin

Cost of Serving LLMs

We Don’t Need Higher Peak Intelligence, Only More Intelligence Density

Fin: Running a Reliable Service over Unreliable Parts

Think Fast: Reasoning at 3ms a Token

A Causal Inference Approach to Measuring the Impact of Improved RAG Content

Generating Knowledge Center Content from Customer Service Conversations

Do you really need a Vector Search Database?

The Agency, Control, Reliability (ACR) Tradeoff for Agents

An Actor-Critic Approach to Reduce Hallucinations

Slower Feels Smarter? Experimenting with AI Agent Latency