{"id":575,"date":"2026-04-09T17:04:00","date_gmt":"2026-04-09T17:04:00","guid":{"rendered":"https:\/\/fin.ai\/research\/?p=575"},"modified":"2026-04-09T17:18:26","modified_gmt":"2026-04-09T17:18:26","slug":"low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity","status":"publish","type":"post","link":"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/","title":{"rendered":"Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity"},"content":{"rendered":"\n\n\n\n\n<h2 id=\"introduction\" class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p><\/p>\n\n\n\n<p>Large Transformer inference is increasingly <strong>memory-bandwidth bound rather than compute-bound<\/strong>. In autoregressive decoding, each token requires repeatedly reading the KV cache from memory, and this cost scales linearly with sequence length, layers, and head count. In long-context settings, the KV cache can rival, or exceed, the model\u2019s parameter memory, making memory movement, not FLOPs, the dominant bottleneck.<\/p>\n\n\n\n<p>This post introduces <strong>Low-Rank Key-Value (LRKV) attention<\/strong>, a drop-in modification to multi-head attention that reduces KV cache size by <strong>45\u201353%<\/strong> vs standard MHA, while achieving <strong>lower test loss<\/strong> across model scales (128M \u2192 6.3B), faster convergence in training steps, and stronger downstream performance after supervised midtraining.<\/p>\n\n\n\n<p>The key idea is that attention heads are not independent. There\u2019s structured redundancy across heads &#8211; yet fully sharing keys\/values (like in MQA\/GQA) can constrain expressivity. LRKV instead exploits redundancy using a <strong>shared full-rank KV basis<\/strong> plus <strong>head-specific low-rank residuals<\/strong>, yielding a <strong>continuous knob<\/strong> between complete sharing and full per-head independence<\/p>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"486\" src=\"https:\/\/s47652.pcdn.co\/research\/wp-content\/uploads\/2026\/04\/Code_Generated_Image.jpeg\" alt=\"\" class=\"wp-image-582\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/Code_Generated_Image.jpeg 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/Code_Generated_Image-300x142.jpeg 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/Code_Generated_Image-768x365.jpeg 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Comparison of attention mechanisms. Standard MHA uses HHH independent K\/V projections per head (high KV cache). MQA\/GQA share K\/V (low cache, reduced head-specific detail). <strong>LRKV combines a shared full-rank projection with head-specific low-rank residuals, achieving cache cost 2L(dh+Hr) while preserving head diversity.<\/strong><\/figcaption><\/figure>\n\n\n\n<h2 id=\"why-this-matters-the-kv-cache-bottleneck\" class=\"wp-block-heading\">Why this matters: t<strong>he KV cache bottleneck<\/strong><\/h2>\n\n\n\n<p>In autoregressive decoding, each layer caches keys and values for all previously generated tokens.<\/p>\n\n\n\n<p>For a sequence length <strong>L<\/strong>, number of heads <strong>H<\/strong>, per-head dimension <strong>d<sub>h<\/sub><\/strong>, and hidden dimension <strong>d<\/strong> =<strong>Hd<sub>h<\/sub><\/strong>, standard multi-head attention (MHA) caches, for each head, keys and values $$\\mathbf{K_h}, \\mathbf{V_h} \\in \\mathbb{R}^{L \\times d_h}$$<\/p>\n\n\n\n<p>So the KV cache memory per layer scales as:<\/p>\n\n\n\n<p>$$M_{\\text{standard}} = 2 L H d_h = 2 L d$$<\/p>\n\n\n\n<p>Existing methods such as MQA (Multi-Query Attention) and GQA (Grouped-Query Attention) reduce cache size by sharing K\/V across heads (or groups). This often improves throughput, but it forces heads to look through the same K\/V representations, reducing representational diversity.<\/p>\n\n\n\n<p>Empirically and theoretically, we know that heads specialize: different heads represent complementary syntactic and semantic patterns, and so fully sharing K\/V across attention heads can degrade capabilities such as code generation and structured reasoning.<\/p>\n\n\n\n<p>At the same time, we know that head specialization is not fully independent: recent analyses show high correlation and overlapping subspaces across head projections, so the redundancy is structured rather than random.<\/p>\n\n\n\n<p>This motivates the central question:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>Can we compress the KV cache by exploiting cross-head redundancy without comprimising head specialization?<\/strong><\/p>\n<\/blockquote>\n\n\n\n<p>This bring us to our main contribution,<em> Low-Rank Key Value Attention (LRKV).<\/em><\/p>\n\n\n\n<h2 id=\"background\" class=\"wp-block-heading\">Background<\/h2>\n\n\n\n<p>Before discussing our proposed LRKV attention mechanism, we give a concise refresher on the well-established baselines that have been the mainstay mechanisms used in Transformers. If you are already familiar with Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped Query Attention (GQA) and Multi-Latent Attention (MLA), you can skip this section.<\/p>\n\n\n\n<p>For <strong>Multi-Head Attention<\/strong>, each attention head maintains its own independent key and value projection matrices. Standard attention uses per-head key and value projections:<\/p>\n\n\n\n<p>$$\\mathbf{K}_h = \\mathbf{X} \\mathbf{W}_h^K, \\quad \\mathbf{V}_h = \\mathbf{X} \\mathbf{W}_h^V \\quad \\text{where} \\quad \\mathbf{W}_h^{K,V} \\in \\mathbb{R}^{d \\times d_h}$$<\/p>\n\n\n\n<p>The full projection matrix is formed by concatenating H independent weight blocks side by side, i.e. there is no parameter sharing whatsoever between heads. This gives each head complete freedom to learn specialised key\/value representations, making MHA the most expressive configuration available. However, this expressiveness comes at a steep cost: the KV cache scales linearly with the number of heads (\\(2Hd_h\\) values stored per token), making it the most memory-intensive option during long-context inference.<\/p>\n\n\n\n<p>All other attention variants can be understood as different strategies for reducing this cost while preserving as much of MHA&#8217;s expressiveness as possible.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"976\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mha_diagram-2-1024x976.png\" alt=\"\" class=\"wp-image-618\" style=\"width:459px;height:auto\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mha_diagram-2-1024x976.png 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mha_diagram-2-300x286.png 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mha_diagram-2-768x732.png 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mha_diagram-2-1536x1464.png 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mha_diagram-2-2048x1953.png 2048w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mha_diagram-2-1320x1259.png 1320w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Multi-Query Attention<\/strong> takes the most aggressive approach to KV cache reduction: all heads share a single key and value projection. Each head still computes its own query, preserving some capacity for diverse attention patterns. However, every head attends over the exact same keys and values. The KV cache shrinks from \\(2Hd_h\\) to just \\(2d_h\\) per token &#8211; a 75% parameter reduction that is transformative for long-context inference where cache memory is the primary bottleneck.&nbsp;<\/p>\n\n\n\n<p>But the trade-off is severe: because all heads attend over identical keys and values, the only way they can differentiate is through their query projections. In practice, heads tend to converge on similar attention patterns, reducing the model&#8217;s ability to capture diverse linguistic phenomena simultaneously. In fact, in our paper (see at the end), we quantitatively show that the query parameters are forced to diversify more compared to other less restrictive mechanisms because K and V are the same across heads. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"981\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mqa_diagram-2-1024x981.png\" alt=\"\" class=\"wp-image-619\" style=\"width:501px;height:auto\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mqa_diagram-2-1024x981.png 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mqa_diagram-2-300x287.png 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mqa_diagram-2-768x735.png 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mqa_diagram-2-1536x1471.png 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mqa_diagram-2-2048x1961.png 2048w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mqa_diagram-2-1320x1264.png 1320w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Grouped-Query Attention<\/strong> offers a middle ground between MHA and MQA. Attention heads are divided into G groups, and all heads within the same group share a single key and value projection. In the example above, \\(G= 2\\): heads 1 &amp; 2 share one projection while heads 3 &amp; 4 share another. This allows groups to specialize in different roles. For instance, one group might focus on local syntax while another captures long-range dependencies.&nbsp;The KV cache reduces from \\(2Hd_h\\) to \\(2Gd_h\\) per token. In practice, G is typically set to H\/4 or H\/8, yielding a 4-8\u00d7 cache reduction.<\/p>\n\n\n\n<p>The key limitation of this approach is that sharing boundaries are fixed at&nbsp;architecture design time and cannot adapt to the data. Heads assigned to the same group must share representations regardless of whether their learned roles are compatible. This is a coarse-grained constraint that limits flexibility compared to methods that can adapt per-head. &nbsp; &nbsp; &nbsp; &nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"981\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/gqa_diagram-2-1024x981.png\" alt=\"\" class=\"wp-image-620\" style=\"width:487px;height:auto\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/gqa_diagram-2-1024x981.png 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/gqa_diagram-2-300x287.png 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/gqa_diagram-2-768x735.png 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/gqa_diagram-2-1536x1471.png 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/gqa_diagram-2-2048x1961.png 2048w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/gqa_diagram-2-1320x1264.png 1320w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Introduced in DeepSeek-V2, <strong>Multi-Latent Attention<\/strong> takes a fundamentally different approach to KV cache compression. Rather than sharing projections between heads (like MQA\/GQA), MLA compresses all key\/value information into a single low-dimensional latent vector per token. During inference, only this compact latent is stored in the KV cache. When attention is computed, the full per-head keys and values are reconstructed on the fly via learned up-projection matrices. This decouples the cache cost from the number of heads entirely.&nbsp;<\/p>\n\n\n\n<p>This architecture works in two stages. First, a down-projection compresses each token into a latent of dimension \\(d_c\\) (much smaller than \\(d\\)). Second, separate up-projections reconstruct per-head keys and values from this shared latent. The effective key projection is an 8\u00d78 matrix but has rank at most d_c. This resembles MHA&#8217;s full projection but lives in a lower-dimensional subspace.<\/p>\n\n\n\n<p>There&#8217;s a fundamental trade-off here: heads can specialise (unlike MQA\/GQA) &#8211; but they must reconstruct their keys and values from a shared compressed representation. Any per-head information that does not survive this bottleneck is permanently lost at inference time. We can compare this with LRKV, where each head&#8217;s low-rank correction is baked directly into the weight matrix &#8211; getting around the information bottleneck during inference.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"577\" height=\"1024\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mla_diagram-11-577x1024.png\" alt=\"\" class=\"wp-image-624\" style=\"width:506px;height:auto\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mla_diagram-11-577x1024.png 577w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mla_diagram-11-169x300.png 169w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mla_diagram-11-768x1362.png 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mla_diagram-11-866x1536.png 866w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mla_diagram-11-1155x2048.png 1155w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/mla_diagram-11-1320x2342.png 1320w\" sizes=\"auto, (max-width: 577px) 100vw, 577px\" \/><\/figure>\n\n\n\n<p>Given this background on related attention mechanisms,  we now move to explaining LRKV. <\/p>\n\n\n\n<h2 id=\"lrkv-parameterization-shared-full-rank-base-per-head-low-rank-residual\" class=\"wp-block-heading\"><strong>LRKV parameterization: shared full-rank base + per-head low-rank residual<\/strong><\/h2>\n\n\n\n<p>LRKV factorizes each head\u2019s key and value projection into a shared full-rank base plus a low-rank residual:<\/p>\n\n\n\n<p>$$\\mathbf{W}_h^K = \\mathbf{W}_{\\text{shared}}^K + \\mathbf{U}_h^K (\\mathbf{B}_h^K)^\\top, \\quad \\mathbf{W}_h^V = \\mathbf{W}_{\\text{shared}}^V + \\mathbf{U}_h^V (\\mathbf{B}_h^V)^\\top$$<\/p>\n\n\n\n<p>Here <strong>W<sub>shared<\/sub><\/strong> is dense, full-rank, and shared across all heads in the layer,<\/p>\n\n\n\n<p>$$\\mathbf{U}_h^{K,V} \\in \\mathbb{R}^{d \\times r}, \\quad<br>\\mathbf{B}_h^{K,V} \\in \\mathbb{R}^{d_h \\times r}$$<\/p>\n\n\n\n<p>Finally, \\(r &lt;&lt; d_h\\) is the residual rank.  <\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"502\" height=\"1024\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/lrkv_diagram-5-502x1024.png\" alt=\"\" class=\"wp-image-630\" style=\"width:497px;height:auto\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/lrkv_diagram-5-502x1024.png 502w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/lrkv_diagram-5-147x300.png 147w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/lrkv_diagram-5-768x1567.png 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/lrkv_diagram-5-753x1536.png 753w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/lrkv_diagram-5-1004x2048.png 1004w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/lrkv_diagram-5-1320x2693.png 1320w\" sizes=\"auto, (max-width: 502px) 100vw, 502px\" \/><\/figure>\n\n\n\n<p>Our interpretation is that the shared base learns a <strong>global KV basis<\/strong> for the layer, and each head then learns a small low-rank deviation from that basis. This gives a <em>continuous interpolation<\/em>: \\(r = 0\\) reduces to full sharing of K\/V within a layer (MQA-style limit) and increasing \\(r\\) increases head-specific capacity, moving toward MHA.<\/p>\n\n\n\n<p>A subtle but important design choice: LRKV applies the factorization only to <strong>K and V<\/strong> (the cached parts). Queries remain unconstrained (per-head), preserving attention expressivity while targeting the true inference bottleneck.<\/p>\n\n\n\n<h2 id=\"kv-caching-in-lrkv\" class=\"wp-block-heading\"><strong>KV caching in LRKV<\/strong><\/h2>\n\n\n\n<p>At inference time, we want to avoid caching <strong>K<sub>h<\/sub><\/strong> and <strong>V<sub>h<\/sub><\/strong> for every head and every token. LRKV caches shared features once per layer:<\/p>\n\n\n\n<p>$$ \\mathbf{K}_{\\text{shared}} = \\mathbf{X} W_{\\text{shared}}^K \\in \\mathbb{R}^{L \\times d_h}, \\quad \\mathbf{V}_{\\text{shared}} = \\mathbf{X} W_{\\text{shared}}^V \\in \\mathbb{R}^{L \\times d_h} $$<\/p>\n\n\n\n<p>Per-head latents:<\/p>\n\n\n\n<p>$$\\mathbf{R}_h^K = \\mathbf{X} \\mathbf{U}_h^K \\in \\mathbb{R}^{L \\times r}, \\quad<br>\\mathbf{R}_h^V = \\mathbf{X} \\mathbf{U}_h^V \\in \\mathbb{R}^{L \\times r}$$<\/p>\n\n\n\n<p>Then the implied per-head features are:<\/p>\n\n\n\n<p>$$ \\mathbf{K}_h = \\mathbf{K}_{\\text{shared}} + \\mathbf{R}_h^K (\\mathbf{B}_h^K)^\\top, \\quad \\mathbf{V}_h = \\mathbf{V}_{\\text{shared}} + \\mathbf{R}_h^V (\\mathbf{B}_h^V)^\\top $$<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p>Crucially, <strong>B<\/strong><sub>h<\/sub> are parameters, not cached per token. So the cache memory becomes:<\/p>\n\n\n\n<p>$$M_{\\text{LRKV}} = 2 L d_h \\; (\\text{shared } K,V)\\quad 2 L H r \\; (\\text{per-head latents}) = 2L(d_h + Hr)$$<\/p>\n\n\n\n<p>Relative to standard MHA:<\/p>\n\n\n\n<p>$$\\frac{M_{\\text{LRKV}}}{M_{\\text{standard}}}\\frac{d_h + Hr}{H d_h}\\frac{1}{H} + \\frac{r}{d_h}.$$<\/p>\n\n\n\n<p>This is the cleanest engineering knob LRKV provides: for fixed H and d<sub>h<\/sub>, the residual rank r trades cache size for per-head flexibility.<\/p>\n\n\n\n<h2 id=\"exact-attention-without-explicitly-reconstructing-full-k-v-tensors\" class=\"wp-block-heading\"><strong>Exact attention without explicitly reconstructing full K\/V tensors<\/strong><\/h2>\n\n\n\n<p>Naively reconstructing the full per-head keys and values for every cached token would erase the practical gain. Instead, LRKV computes logits and outputs exactly via associativity.<\/p>\n\n\n\n<p>For a decoding step with query <strong>q<sub>h<\/sub><\/strong> the attention logits are:<\/p>\n\n\n\n<p>$$\\mathbf{q}_h \\mathbf{K}_h^\\top=\\mathbf{q}_h\\mathbf{K}_{\\text{shared}}^\\top+(\\mathbf{q}_h \\mathbf{B}_h^K)(\\mathbf{R}_h^K)^\\top, \\quad \\mathbf{q}_h \\in \\mathbb{R}^{d_h}$$<\/p>\n\n\n\n<p>For attention weights <strong>a<sub>h<\/sub><\/strong> the value aggregation is:<\/p>\n\n\n\n<p>$$\\mathbf{a}_h \\mathbf{V}_h=\\mathbf{a}_h \\mathbf{V}_{\\text{shared}}+(\\mathbf{a}_h \\mathbf{R}_h^V)(\\mathbf{B}_h^V)^\\top, \\quad \\mathbf{a}_h \\in \\mathbb{R}^{1 \\times L}$$<\/p>\n\n\n\n<p>These expressions compute the same outputs as full reconstruction, but operate on smaller cached tensors:<br>$$\\mathbf{K}_{\\text{shared}}, \\mathbf{V}_{\\text{shared}} \\in \\mathbb{R}^{L \\times d_h} \\quad \\text{and} \\quad \\mathbf{R}_h^K, \\mathbf{R}_h^V \\in \\mathbb{R}^{L \\times r}.$$<\/p>\n\n\n\n<p> That is precisely where the memory win is realized.<\/p>\n\n\n\n<h2 id=\"what-does-lrkv-cost-in-compute\" class=\"wp-block-heading\"><strong>What does LRKV cost in compute?<\/strong><\/h2>\n\n\n\n<p>During decoding, standard MHA\u2019s dominant per-head cost scales as:<\/p>\n\n\n\n<p>$$O(L d_h), \\quad \\text{while LRKV adds } O(L r + r d_h).$$<\/p>\n\n\n\n<p>For long contexts: $$L \\gg 1 \\quad \\Rightarrow \\quad O(L r) \\text{ dominates, giving overhead } \\sim \\frac{r}{d_h}.$$<\/p>\n\n\n\n<p>In modern inference, we\u2019re often <strong>memory bandwidth bound<\/strong>, so reducing the bytes moved can dominate a modest FLOP increase. LRKV reduces total bytes read from cache proportionally to its cache reduction: it reads <strong>two shared tensors<\/strong> plus <strong>small per-head latents<\/strong> instead of full per-head K\/V.<\/p>\n\n\n\n<h2 id=\"results\" class=\"wp-block-heading\"><strong>Results<\/strong><\/h2>\n\n\n\n<p>With the design space now fully mapped, from MHA&#8217;s full independence, through MQA and GQA&#8217;s discrete sharing strategies, to MLA&#8217;s latent bottleneck and our proposed LRKV&#8217;s continuous low-rank interpolation, the natural question is: <strong>how do these architectural choices actually play out in practice?<\/strong> We evaluate all five methods under identical pretraining and midtraining conditions across three model scales (128M, 2.5B, and 6.3B&nbsp;parameters), measuring both pretraining loss and downstream task performance on five diverse benchmarks. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Cross-scale pretraining curves<\/strong><\/h3>\n\n\n\n<p>Our results are very encouraging: LRKV reaches each baseline&#8217;s final validation performance 18-30% faster, averaging 23.6% training compute savings across all baselines while achieving better final performance. Critically, this reveals an asymmetric advantage: LRKV reaches any baseline&#8217;s performance target early in training, but no baseline reaches LRKV&#8217;s final performance (0.719 BPB) even after the full token budget.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"233\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_cross_scale_ce_loss_beautiful-1024x233.jpg\" alt=\"\" class=\"wp-image-583\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_cross_scale_ce_loss_beautiful-1024x233.jpg 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_cross_scale_ce_loss_beautiful-300x68.jpg 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_cross_scale_ce_loss_beautiful-768x175.jpg 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_cross_scale_ce_loss_beautiful-1536x350.jpg 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_cross_scale_ce_loss_beautiful-2048x466.jpg 2048w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_cross_scale_ce_loss_beautiful-1320x300.jpg 1320w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Cross-scale pretraining curves (128M, 1.2B, 2.5B, 6.3B). Test cross-entropy loss vs training tokens (and compute). LRKV is consistently competitive, achieving the lowest test loss at multiple scales.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"433\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_training_efficiency_analysis-1-1024x433.jpg\" alt=\"\" class=\"wp-image-590\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_training_efficiency_analysis-1-1024x433.jpg 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_training_efficiency_analysis-1-300x127.jpg 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_training_efficiency_analysis-1-768x325.jpg 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_training_efficiency_analysis-1-1536x650.jpg 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_training_efficiency_analysis-1-2048x867.jpg 2048w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_training_efficiency_analysis-1-1320x559.jpg 1320w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">LRKV achieves superior training efficiency alongside best performance (2.5B scale).<br><strong>Memory vs Performance<\/strong> (left): Test BPB versus KV cache percentage for all methods. LRKV achieves optimal trade-off with lowest BPB at 48.4% cache usage (2.5B scale).<strong> Training Efficiency Advantage<\/strong> (right): LRKV reaches each baseline&#8217;s final test loss, quantifying training compute savings. LRKV reaches all baselines&#8217; performance earlier.<\/figcaption><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Across 128M \u2192 6.3B, LRKV reaches <strong>lower test loss<\/strong> than MHA, MQA\/GQA, and MLA, while using <strong>45\u201353%<\/strong> of MHA KV cache.<\/li>\n\n\n\n<li>LRKV reaches equivalent baseline quality <strong>18\u201325% faster<\/strong> (in training steps), i.e., better sample efficiency.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Training efficiency &amp; memory\/performance tradeoff <\/h3>\n\n\n\n<p>The residual rank <strong><em>r <\/em><\/strong>controls the tradeoff. The ablation shows monotonic improvement with larger rank, and a strong Pareto frontier relative to other KV-efficient methods.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"431\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_ablation_cache_and_curves-1024x431.jpg\" alt=\"\" class=\"wp-image-584\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_ablation_cache_and_curves-1024x431.jpg 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_ablation_cache_and_curves-300x126.jpg 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_ablation_cache_and_curves-768x323.jpg 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_ablation_cache_and_curves-1536x646.jpg 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_ablation_cache_and_curves-2048x861.jpg 2048w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_ablation_cache_and_curves-1320x555.jpg 1320w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">LRKV rank ablations: final test loss vs cache size, and training curves by rank. LRKV dominates the memory\/performance tradeoff space across ranks.<\/figcaption><\/figure>\n\n\n\n<p>LRKV appears to be not merely \u201ca compression trick\u201d, but a bias that improves optimization and\/or effective capacity under the same token budget. Empirically, we see a consistent \u201cuseful rank\u201d regime around: $$r \\approx 0.36\u20130.43 \\times d_h$$ as a threshold where LRKV matches\/exceeds MHA while still delivering ~50% cache reduction.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Long-context pretraining<\/strong><\/h3>\n\n\n\n<p>At longer sequence lengths, the benefits of KV-efficient attention become more pronounced. The below figure shows that LRKV not only maintains its advantage over standard MHA, but actually widens the gap in the long-context regime. At 8K context, LRKV achieves lower test loss while using roughly half the KV cache, outperforming both MHA and other KV-efficient baselines such as MQA, GQA, and MLA. This suggests that the low-rank decomposition is not merely compressing redundant structure, but acting as an effective inductive bias for long-range modeling. As context length increases and KV cache pressure becomes the dominant bottleneck, LRKV\u2019s combination of memory efficiency and preserved head diversity translates directly into improved modeling performance.<strong><br><\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"671\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_512m_8k_context_ce_loss_beautiful-1024x671.jpg\" alt=\"\" class=\"wp-image-588\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_512m_8k_context_ce_loss_beautiful-1024x671.jpg 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_512m_8k_context_ce_loss_beautiful-300x196.jpg 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_512m_8k_context_ce_loss_beautiful-768x503.jpg 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_512m_8k_context_ce_loss_beautiful-1536x1006.jpg 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_512m_8k_context_ce_loss_beautiful-2048x1341.jpg 2048w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_512m_8k_context_ce_loss_beautiful-1320x864.jpg 1320w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">512M model trained at 8k context. LRKV widens the advantage over baselines in long-context settings, where KV cache pressure is highest.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Downstream Task Performance <\/strong><\/h3>\n\n\n\n<p>A reasonable question is whether these gains turn into better downstream task performance. On a standardized evaluation after supervised mid-training, LRKV achieved the highest combined score across ARC, MMLU, GSM8K, and HumanEval, confirming that its improved pretraining efficiency translates into better downstream performance.<\/p>\n\n\n\n<p>The below figure shows LRKV consistently achieves the highest combined accuracy at every scale &#8211; 18.9% (128M), 37.9% (2.5B), and 40.2% (6.3B) &#8211; demonstrating that the pretraining gains afforded by LRKV transfer reliably to downstream capabilities. At the 2.5B and 6.3B scales, LRKV leads on four of five benchmarks, with particularly strong margins on knowledge-intensive tasks: at 6.3B it surpasses the next-best method by +2.3% points on ARC-Easy, +4.4% on ARC-Challenge, and +1.8 on MMLU. Notably, MQA suffers a pronounced collapse on HumanEval at all three scales (2.4%, 3.7%, 4.3%). Crucially, the gap between LRKV and competing methods widens with scale, rising from +0.9 over MLA at 128M to +3.3 at 6.3B, reinforcing the conclusion that LRKV&#8217;s architectural advantages compound as model capacity increases.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"618\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/midtraining_results_v2-1-1024x618.png\" alt=\"\" class=\"wp-image-596\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/midtraining_results_v2-1-1024x618.png 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/midtraining_results_v2-1-300x181.png 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/midtraining_results_v2-1-768x463.png 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/midtraining_results_v2-1-1536x926.png 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/midtraining_results_v2-1-2048x1235.png 2048w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/midtraining_results_v2-1-1320x796.png 1320w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Why does LRKV preserve head diversity (and why that\u2019s non-trivial)?<\/strong><\/h3>\n\n\n\n<p>A common failure mode of aggressive KV sharing is that heads lose the ability to represent distinct interactions. LRKV claims you can reduce KV cache <em>without<\/em> degrading diversity by preserving a shared basis plus low-rank head-specific deviations. We measure head diversity using gauge-invariant similarity metrics derived from bilinear forms:<\/p>\n\n\n\n<p>$$\\mathbf{A}_h = \\mathbf{W}_h^Q (\\mathbf{W}_h^K)^\\top$$<\/p>\n\n\n\n<p>We then compare the architectures via similarity matrices and effective-rank via eigenvalue entropy.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"186\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_similarity_heatmaps_2.5B-1024x186.jpg\" alt=\"\" class=\"wp-image-591\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_similarity_heatmaps_2.5B-1024x186.jpg 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_similarity_heatmaps_2.5B-300x55.jpg 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_similarity_heatmaps_2.5B-768x140.jpg 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_similarity_heatmaps_2.5B-1536x280.jpg 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_similarity_heatmaps_2.5B-2048x373.jpg 2048w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_similarity_heatmaps_2.5B-1320x240.jpg 1320w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Pairwise gauge-invariant head similarity. LRKV\u2019s structure is nearly indistinguishable from MHA (off-diagonal similarity remains low), consistent with preserved specialization.<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Effective rank across scales<\/strong><\/h4>\n\n\n\n<p>We see LRKV exhibits very similar patterns to Standard MHA with sufficient rank of r=64 and achieves 98.3% effective rank at 2.5B scale versus 98.9% for Standard MHA. In contrast, MQA achieves only 86.2% and GQA 95.4%.<\/p>\n\n\n\n<p><em>Interpreting uncentered vs. PCA-based effective rank.<br><\/em>The distinction between uncentered and PCA-based effective rank reveals LRKV&#8217;s factorization structure. Uncentered analysis measures total variance including the shared mean direction, the global structure captured by <strong>W<\/strong><sub>shared<\/sub>. PCA-based analysis centers the Gram matrix, isolating variance around this mean and measuring true head independence. The modest 4.8% gap indicates LRKV achieves diversity primarily through genuine per-head specialization rather than merely perturbing a dominant shared structure. For comparison, MQA shows dramatic improvement from uncentered (86.2%) to centered (91.0%), a compensation effect where forced KV sharing creates a strong mean direction, but heads recover diversity by aggressively diversifying query projections around this baseline.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"380\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_effective_rank_combined-1024x380.jpg\" alt=\"\" class=\"wp-image-592\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_effective_rank_combined-1024x380.jpg 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_effective_rank_combined-300x111.jpg 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_effective_rank_combined-768x285.jpg 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_effective_rank_combined-1536x571.jpg 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_effective_rank_combined-2048x761.jpg 2048w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/04\/updated_effective_rank_combined-1320x490.jpg 1320w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">LRKV preserves head diversity across scales. Gauge-invariant effective rank shows LRKV with sufficient rank matches Standard MHA at 128M, while at 2.5B LRKV achieves 98.3% vs 98.9% for MHA using 48.4% of KV cache.<\/figcaption><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 id=\"implementation-notes\" class=\"wp-block-heading\"><strong>Implementation notes<\/strong><\/h2>\n\n\n\n<p><strong>Cache layout<br><\/strong>Store:<\/p>\n\n\n\n<p>$$\\mathbf{K}_{\\text{shared}}, \\mathbf{V}_{\\text{shared}} \\in \\mathbb{R}^{L \\times d_h}, \\quad\\mathbf{R}_h^K, \\mathbf{R}_h^V \\in \\mathbb{R}^{H \\times L \\times r}$$<\/p>\n\n\n\n<p>while <strong>B<sub>h<\/sub><sup>K<\/sup><\/strong>, <strong>B<sub>h<\/sub><sup>V<\/sup><\/strong> are parameters stored in the weights, not the KV cache.<\/p>\n\n\n\n<p><strong>Kernel fusion<br><\/strong>The associativity forms used in LRKV are:<\/p>\n\n\n\n<p>$$\\mathbf{q}_h \\mathbf{K}_h^\\top = \\mathbf{q}_h \\mathbf{K}_{\\text{shared}}^\\top + (\\mathbf{q}_h \\mathbf{B}_h^K)(\\mathbf{R}_h^K)^\\top, \\quad<br>\\mathbf{a}_h \\mathbf{V}_h = \\mathbf{a}_h \\mathbf{V}_{\\text{shared}} + (\\mathbf{a}_h \\mathbf{R}_h^V)(\\mathbf{B}_h^V)^\\top$$<\/p>\n\n\n\n<p>These allow exact attention computation to be embedded in fused kernels (e.g. FlashAttention) without reconstructing full <strong>K<sub>h<\/sub><\/strong> and <strong>V<sub>h<\/sub><\/strong> tensors.<\/p>\n\n\n\n<p><strong>Choosing rank r<\/strong><\/p>\n\n\n\n<p>Use the memory ratio:<\/p>\n\n\n\n<p>$$\\text{ratio} = \\frac{1}{H} + \\frac{r}{d_h}, \\quad r \\approx (0.36\\text{\u2013}0.43)\\,d_h$$<\/p>\n\n\n\n<p>Choose r to match your cache target, then validate quality.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 id=\"takeaways\" class=\"wp-block-heading\"><strong>Takeaway<\/strong>s<\/h2>\n\n\n\n<p>LRKV is a structural change to attention that:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Targets the true production bottleneck<\/strong> (KV cache memory + bandwidth) <\/li>\n\n\n\n<li>Exploits <strong>structured redundancy across heads<\/strong> <\/li>\n\n\n\n<li>Provides a <strong>smooth knob<\/strong> (rank r) between MQA-like sharing and MHA-like independence<\/li>\n\n\n\n<li>Empirically delivers a strictly better quality\/efficiency frontier than common baselines, on analysis across a wide range of scales.<\/li>\n<\/ol>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>For full details and additional experiments, see the paper: <a href=\"https:\/\/arxiv.org\/abs\/2601.11471\" target=\"_blank\" rel=\"noreferrer noopener\">Low-Rank Key-Value Attention (arXiv)<\/a>.<\/p>\n<\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p>In autoregressive decoding, each token requires repeatedly reading the KV cache from memory, and this cost scales linearly with sequence length, layers, and head count. This post introduces Low-Rank Key-Value (LRKV) attention, a drop-in modification to multi-head attention that reduces KV cache size by 45\u201353% vs standard MHA, while achieving lower test loss across model scales (128M \u2192 6.3B), faster convergence in training steps, and stronger downstream performance after supervised midtraining.<\/p>\n","protected":false},"author":52,"featured_media":170,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"coauthors":[36],"class_list":["post-575","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v24.6 (Yoast SEO v24.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity - \/research<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity\" \/>\n<meta property=\"og:description\" content=\"In autoregressive decoding, each token requires repeatedly reading the KV cache from memory, and this cost scales linearly with sequence length, layers, and head count. This post introduces Low-Rank Key-Value (LRKV) attention, a drop-in modification to multi-head attention that reduces KV cache size by 45\u201353% vs standard MHA, while achieving lower test loss across model scales (128M \u2192 6.3B), faster convergence in training steps, and stronger downstream performance after supervised midtraining.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/\" \/>\n<meta property=\"og:site_name\" content=\"\/research\" \/>\n<meta property=\"article:published_time\" content=\"2026-04-09T17:04:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-04-09T17:18:26+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1344\" \/>\n\t<meta property=\"og:image:height\" content=\"896\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"James O&#039;Neill\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@intercom\" \/>\n<meta name=\"twitter:site\" content=\"@intercom\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"James O&#039;Neill\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"15 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/\"},\"author\":{\"name\":\"James O'Neill\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/4cf4b5436180effdfbf807d57bc1d1f2\"},\"headline\":\"Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity\",\"datePublished\":\"2026-04-09T17:04:00+00:00\",\"dateModified\":\"2026-04-09T17:18:26+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/\"},\"wordCount\":3048,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/fin.ai\/research\/#organization\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/\",\"url\":\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/\",\"name\":\"Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity - \/research\",\"isPartOf\":{\"@id\":\"https:\/\/fin.ai\/research\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png\",\"datePublished\":\"2026-04-09T17:04:00+00:00\",\"dateModified\":\"2026-04-09T17:18:26+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#primaryimage\",\"url\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png\",\"contentUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png\",\"width\":1344,\"height\":896},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/fin.ai\/research\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/fin.ai\/research\/#website\",\"url\":\"https:\/\/fin.ai\/research\/\",\"name\":\"Intercom.ai\",\"description\":\"Insights and blogs from the AI Group building Fin at Intercom\",\"publisher\":{\"@id\":\"https:\/\/fin.ai\/research\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/fin.ai\/research\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/fin.ai\/research\/#organization\",\"name\":\"Intercom.ai\",\"url\":\"https:\/\/fin.ai\/research\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png\",\"contentUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png\",\"width\":1024,\"height\":1024,\"caption\":\"Intercom.ai\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/intercom\",\"https:\/\/www.linkedin.com\/company\/intercom\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/4cf4b5436180effdfbf807d57bc1d1f2\",\"name\":\"James O'Neill\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/image\/4de2ec3ffa5b66b0526c4a192e3fd54b\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/42137610025c1e5eb9da0c4452038607?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/42137610025c1e5eb9da0c4452038607?s=96&d=mm&r=g\",\"caption\":\"James O'Neill\"},\"description\":\"I am a Staff Machine Learning Researcher currently focusing on large scale distributed pretraining of LLMs and research pertaining to attention, mixture of experts and state space models.\",\"url\":\"https:\/\/fin.ai\/research\/author\/james-oneill\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity - \/research","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/","og_locale":"en_US","og_type":"article","og_title":"Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity","og_description":"In autoregressive decoding, each token requires repeatedly reading the KV cache from memory, and this cost scales linearly with sequence length, layers, and head count. This post introduces Low-Rank Key-Value (LRKV) attention, a drop-in modification to multi-head attention that reduces KV cache size by 45\u201353% vs standard MHA, while achieving lower test loss across model scales (128M \u2192 6.3B), faster convergence in training steps, and stronger downstream performance after supervised midtraining.","og_url":"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/","og_site_name":"\/research","article_published_time":"2026-04-09T17:04:00+00:00","article_modified_time":"2026-04-09T17:18:26+00:00","og_image":[{"width":1344,"height":896,"url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png","type":"image\/png"}],"author":"James O'Neill","twitter_card":"summary_large_image","twitter_creator":"@intercom","twitter_site":"@intercom","twitter_misc":{"Written by":"James O'Neill","Est. reading time":"15 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#article","isPartOf":{"@id":"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/"},"author":{"name":"James O'Neill","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/4cf4b5436180effdfbf807d57bc1d1f2"},"headline":"Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity","datePublished":"2026-04-09T17:04:00+00:00","dateModified":"2026-04-09T17:18:26+00:00","mainEntityOfPage":{"@id":"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/"},"wordCount":3048,"commentCount":0,"publisher":{"@id":"https:\/\/fin.ai\/research\/#organization"},"image":{"@id":"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#primaryimage"},"thumbnailUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png","inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/","url":"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/","name":"Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity - \/research","isPartOf":{"@id":"https:\/\/fin.ai\/research\/#website"},"primaryImageOfPage":{"@id":"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#primaryimage"},"image":{"@id":"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#primaryimage"},"thumbnailUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png","datePublished":"2026-04-09T17:04:00+00:00","dateModified":"2026-04-09T17:18:26+00:00","breadcrumb":{"@id":"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#primaryimage","url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png","contentUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-2-1.png","width":1344,"height":896},{"@type":"BreadcrumbList","@id":"https:\/\/fin.ai\/research\/low-rank-key-value-attention-reducing-kv-cache-memory-and-maintaining-head-diversity\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/fin.ai\/research\/"},{"@type":"ListItem","position":2,"name":"Low-Rank Key Value Attention: Reducing KV Cache Memory and Maintaining Head Diversity"}]},{"@type":"WebSite","@id":"https:\/\/fin.ai\/research\/#website","url":"https:\/\/fin.ai\/research\/","name":"Intercom.ai","description":"Insights and blogs from the AI Group building Fin at Intercom","publisher":{"@id":"https:\/\/fin.ai\/research\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/fin.ai\/research\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/fin.ai\/research\/#organization","name":"Intercom.ai","url":"https:\/\/fin.ai\/research\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/","url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png","contentUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png","width":1024,"height":1024,"caption":"Intercom.ai"},"image":{"@id":"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/intercom","https:\/\/www.linkedin.com\/company\/intercom"]},{"@type":"Person","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/4cf4b5436180effdfbf807d57bc1d1f2","name":"James O'Neill","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/image\/4de2ec3ffa5b66b0526c4a192e3fd54b","url":"https:\/\/secure.gravatar.com\/avatar\/42137610025c1e5eb9da0c4452038607?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/42137610025c1e5eb9da0c4452038607?s=96&d=mm&r=g","caption":"James O'Neill"},"description":"I am a Staff Machine Learning Researcher currently focusing on large scale distributed pretraining of LLMs and research pertaining to attention, mixture of experts and state space models.","url":"https:\/\/fin.ai\/research\/author\/james-oneill\/"}]}},"_links":{"self":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts\/575","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/users\/52"}],"replies":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/comments?post=575"}],"version-history":[{"count":0,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts\/575\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/media\/170"}],"wp:attachment":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/media?parent=575"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/categories?post=575"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/tags?post=575"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/coauthors?post=575"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}