/research

Insights and blogs from the AI Group building Fin at Intercom

Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity

In autoregressive decoding, each token requires repeatedly reading the KV cache from memory, and this cost scales linearly with sequence length, layers, and head count. This post introduces Low-Rank Key-Value (LRKV) attention, a drop-in modification to multi-head attention that reduces KV cache size by 45–53% vs standard MHA, while achieving lower test loss across model scales (128M → 6.3B), faster convergence in training steps, and stronger downstream performance after supervised midtraining.

Read

Articles

21