{"id":653,"date":"2026-05-21T14:47:27","date_gmt":"2026-05-21T14:47:27","guid":{"rendered":"https:\/\/fin.ai\/research\/?p=653"},"modified":"2026-05-21T14:56:49","modified_gmt":"2026-05-21T14:56:49","slug":"what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space","status":"publish","type":"post","link":"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/","title":{"rendered":"What Does It Mean for a Model to &#8216;Think&#8217;? Reasoning, Recursion, and the Operator Design Space"},"content":{"rendered":"\n\n\n\n<h2 id=\"the-looped-transformer-moment\" class=\"wp-block-heading\">The Looped Transformer Moment<\/h2>\n\n\n\n<p>Looped transformers are having a moment &#8211; possibly fuelled by\u00a0<a href=\"https:\/\/x.com\/ChrisHayduk\/status\/2042711699413926262\">rumours that Claude Mythos is built on a looped transformer architecture<\/a>! We&#8217;ve since seen open-source projects such as &#8216;<a href=\"https:\/\/x.com\/KyeGomezB\/status\/2045659150340723107\">OpenMythos<\/a>&#8216;, and\u00a0<a href=\"https:\/\/x.com\/issei_sato\/status\/2050186040414175574\">researchers at the University of Tokyo formally comparing<\/a>\u00a0looped transformers to chain-of-thought reasoning. <\/p>\n\n\n\n<p>While this is all still speculation, it would be consistent with Mythos\u2019s jump on GraphWalks BFS 265K\u20131M (80% vs Opus 4.6\u2019s 38.7%) \u2014 exactly the kind of recursive graph-traversal task where iterative latent refinement should shine.<\/p>\n\n\n\n<p>Whether there&#8217;s truth to the rumours or not, the underlying idea of Looped Transformers is both interesting and deceptively simple: take a transformer, loop it over its own representations, and watch it reason.<\/p>\n\n\n\n<p>The lineage behind this idea runs from Universal Transformers (<a href=\"https:\/\/arxiv.org\/abs\/1807.03819\">Dehghani et al., 2019<\/a>) through the recent theoretical analysis of Looped Transformers (<a href=\"https:\/\/arxiv.org\/abs\/2502.17416\">Saunshi et al., 2025<\/a>), depth-recurrent architectures (<a href=\"https:\/\/arxiv.org\/abs\/2502.05171\">Geiping et al., 2025<\/a>), and the TRM\/HRM family of models that my co-authors and I have been working on. The headline result that sparked much of this recent buzz was a model with only 7 million parameters, looped repeatedly over latent states, achieving 44.6% on ARC-AGI-1 \u2014 dramatically outperforming comparably-sized models and notably surpassing many commercial LLM APIs on this benchmark.<\/p>\n\n\n\n<p>Our recent\u00a0<a href=\"https:\/\/openreview.net\/pdf?id=AEz0zbLuzg\">ICLR 2026 workshop paper<\/a>\u00a0also sits squarely in this space \u2014 we replaced the Transformer blocks in TRM with Mamba-2 hybrids and found something interesting:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The hybrid model is more likely to find the correct answer than the pure Transformer recursive model, <\/li>\n\n\n\n<li>&#8230;but the pure Transformer is more confident at placing the correct answer in first position. <\/li>\n<\/ul>\n\n\n\n<p>I will get to those results later in this post, but first, I want to step back and think about reasoning with LLMs at a higher level. Before we can answer questions like why looped transformers have a reasoning edge, whether we should be looping all LLMs, or what the final form of a reasoning model looks like, we need to understand what reasoning actually\u00a0<em>is<\/em>\u00a0in the context of these models. <\/p>\n\n\n\n<p>What is the source of reasoning? Is it the visible chain-of-thought \u2014 the quality of the intermediate steps, the right decomposition? \u2014 or is it the iterative process itself, providing enough forward passes for the model to refine its answer regardless of what the tokens say? Does reasoning come from the trace or from the recursion?<\/p>\n\n\n\n<p><strong>These questions are not academic <\/strong>&#8211;<strong> <\/strong>they determine whether the path forward is better prompting, better training data for chain-of-thought, or fundamentally different architectures that think in latent space. Getting the framing right matters as we invest in any particular direction.<\/p>\n\n\n\n<h2 id=\"reasoning-as-conditional-distribution-shaping\" class=\"wp-block-heading\">Reasoning as Conditional Distribution Shaping<\/h2>\n\n\n\n<p>An autoregressive transformer models a conditional distribution over token sequences. Given an input&nbsp;\\(x\\), it assigns probability to a continuation by predicting one token at a time. This is the standard next-token prediction view, and it is correct but incomplete.<\/p>\n\n\n\n<p>The problem is that reasoning is not well captured at the token level. A final answer is a semantic object that can be expressed by many different token strings. \u201c42\u201d, \u201cthe result is 42\u201d, and \u201cforty-two\u201d are all the same answer. The right abstraction is therefore not the token distribution but the induced distribution over semantic outcomes:<\/p>\n\n\n\n<p>\\(P(A = a | x) = \\sum_{y \\in Y(a)} ^{} P(y|x)\\)<\/p>\n\n\n\n<p>where&nbsp;\\(Y(a)\\) is the set of all token sequences that express answer \\(a\\). Reasoning, from this perspective, is the process of shifting probability mass among semantic outcomes \u2014 not among surface strings.<\/p>\n\n\n\n<p>This reframing matters because a model might assign low probability to a terse answer but high probability to a full explanation that ends in that answer. The model\u2019s reasoning behaviour is better studied as a distribution over possible solution trajectories and final outcomes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Reasoning is not convergence<\/h3>\n\n\n\n<p>It is tempting to describe autoregressive reasoning as repeated self-refinement: the model keeps processing the same problem until the answer distribution converges. But this is not quite right.<\/p>\n\n\n\n<p>Every generated token becomes part of the next condition. After the model emits a token, the context changes, and the next distribution is conditioned on a new state. Autoregressive reasoning is better described as&nbsp;<strong>path-dependent conditional refinement<\/strong>&nbsp;rather than convergence. The model is not \u201cgetting more confident by re-reading the prompt.\u201d It is constructing a new context at each step, and its predictions are shaped by the path it has taken.<\/p>\n\n\n\n<p>This process may sharpen the answer distribution: the entropy of the answer given the reasoning trace is often lower than the entropy given only the prompt. But entropy reduction is not the same as correctness. If the model generates an incorrect intermediate step, later predictions condition on that error and may become confidently wrong.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Three levels of computation<\/h3>\n\n\n\n<p>A transformer can reason at several different levels:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Within a forward pass<\/strong>: a single pass through the layer stack can resolve references, classify intent, retrieve memorised facts, or perform simple inferences. Some computations complete entirely inside this one pass.<\/li>\n\n\n\n<li><strong>Across tokens (chain-of-thought)<\/strong>: when the model generates intermediate tokens before answering, each new token triggers another forward pass conditioned on an expanded context. This buys additional serial computation depth.<\/li>\n\n\n\n<li><strong>Context-conditioned<\/strong>: the user may provide facts, examples, or constraints. Conditioning on genuinely new information can reduce uncertainty in ways that self-generated tokens cannot.<\/li>\n<\/ul>\n\n\n\n<p>The key distinction is that self-generated reasoning usually adds computation and structure, while externally provided context can add genuinely new information. Both sharpen the answer distribution, but generally through different mechanisms.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The latent-variable view<\/h3>\n\n\n\n<p>We think the most useful abstraction is to treat reasoning as inference over latent solution paths. Let\u00a0\\(x\\) be the problem,\u00a0\\(z\\) a latent reasoning process (a proof, a plan, an algorithm), and\u00a0\\(a\\)\u00a0the final answer. Then:<\/p>\n\n\n\n<p>\\(P(a|x) = \\sum_{z} P(a|x,z)P(z|x)\\)<\/p>\n\n\n\n<p>A direct-answer model tries to marginalise over all possible reasoning paths implicitly \u2014 in a single forward pass. A chain-of-thought model <em>usually<\/em> generates an explicit approximation to one such path and then predicts the answer conditioned on that.<\/p>\n\n\n\n<p>This explains both why chain-of-thought helps and why it can fail. Estimating\u00a0\\(P(a|x,r)\\)\u00a0for a specific reasoning trace \\(r\\)\u00a0is often easier than estimating\u00a0\\(P(a|x)\\)\u00a0directly, because\u00a0\\(r\\) introduces intermediate variables and constraints. But if the sampled path\u00a0\\(r\\)\u00a0is wrong or biased, conditioning on it pushes the model toward an incorrect answer \u2014 it can become more confident because the context is more specific, rather than because it is more true.<\/p>\n\n\n\n<p>As an enhancement, self-consistency methods exploit this view: sample multiple candidate paths and aggregate over the answers. In probabilistic terms, this is a crude approximation to marginalising over multiple possible\u00a0\\(z\\)&#8217;s rather than trusting one sampled trajectory.<\/p>\n\n\n\n<h2 id=\"what-is-chain-of-thought-actually-for\" class=\"wp-block-heading\">What Is Chain-of-Thought Actually For?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">CoT as additional computation<\/h3>\n\n\n\n<p>The most obvious benefit of chain-of-thought is computational. A decoder-only transformer has bounded computation per generated token \u2014 one forward pass through the layer stack. If the model answers immediately, it must map the prompt to the answer in that single pass. But if it generates\u00a0\\(T\\) intermediate tokens first, it gets\u00a0\\(T\\) additional forward passes, each conditioned on an expanded context. Chain-of-thought thus buys serial depth at inference time.<\/p>\n\n\n\n<p><a href=\"https:\/\/arxiv.org\/abs\/2310.07923\">Merrill and Sabharwal (2024)<\/a>\u00a0made this precise in \u201cThe Expressive Power of Transformers with Chain of Thought.\u201d They proved that allowing a transformer to generate intermediate tokens fundamentally extends its computational power, with the expressiveness scaling with the number of intermediate steps. A polynomial number of CoT steps enables transformers to solve exactly the class of polynomial-time problems \u2014 something a bounded-depth transformer without intermediate tokens cannot do. The implication is clear: even if the intermediate tokens contained no semantic information at all, the extra forward passes alone could expand what the model can compute.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Does the reasoning trace actually matter?<\/h3>\n\n\n\n<p>If the computational depth is the key benefit, a natural question follows: does the&nbsp;<em>content<\/em>&nbsp;of the reasoning trace play any role, or is it just a vehicle for additional forward passes?<\/p>\n\n\n\n<p>The \u201cdot by dot\u201d result makes this sharper.\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2404.15758\">Pfau, Merrill, and Bowman (2024)<\/a>\u00a0showed that meaningless filler tokens \u2014 repeated dots \u2014 can help transformers solve certain algorithmic tasks that they cannot solve without intermediate tokens. Their conclusion is that additional tokens can provide computational benefits independent of the semantic content of those tokens. The content was empty; but the computation was useful.<\/p>\n\n\n\n<p>There is further evidence from the faithfulness literature.\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2307.13702\">Lanham et al. (2023)<\/a>\u00a0systematically measured how much models actually condition on their own chain-of-thought when predicting final answers. Their findings are revealing: models show large variation across tasks in how strongly they rely on their CoT, sometimes heavily conditioning on it, sometimes largely ignoring it. Most telling, as models become larger and more capable, they tend to produce\u00a0<em>less<\/em>\u00a0faithful reasoning \u2014 suggesting that stronger models may generate explanatory text while following a predetermined answer pathway. Such a chain-of-thought begins to look more like a post-hoc rationalisation, rather than a causal driver of the answer.<\/p>\n\n\n\n<p>Taken together, these results build a compelling case that the internal recursion \u2014 the additional forward passes \u2014 contributes significantly to reasoning competence, potentially more so than the semantic content of the reasoning trace itself.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">But that is not the full picture<\/h3>\n\n\n\n<p>However, it would be premature to conclude that the reasoning trace is merely a vehicle for computation. There are important confounders that complicate this narrative.<\/p>\n\n\n\n<p><strong>The scratchpad as external memory.<\/strong>&nbsp;A transformer\u2019s hidden activations during one forward pass are transient. Without a scratchpad, the model must compress all relevant intermediate computations into the current residual stream. With a scratchpad, it can write intermediate results into the token sequence, and those tokens become part of the context \u2014 in implementation terms, they produce key-value cache entries that later tokens can attend to. The scratchpad creates an addressable external memory.<\/p>\n\n\n\n<p>Consider arithmetic:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">37 + 48<br>7 + 8 = 15, write 5 carry 1.<br>3 + 4 + 1 = 8.<br>Answer: 85.<\/pre>\n\n\n\n<p><\/p>\n\n\n\n<p>The token \u201ccarry 1\u201d is not just an explanation. It is a stored variable. Later computation attends to it. By pinning intermediate variables into token space, the model does not need to maintain all intermediate state implicitly in a single compressed hidden representation. It can externalise state into the context and retrieve it later through attention. This is a genuine benefit of the&nbsp;<em>content<\/em>&nbsp;of the trace, not just its length.<\/p>\n\n\n\n<p><strong>Commitment and discretisation.<\/strong>&nbsp;There is another subtle role: the scratchpad turns soft latent possibilities into explicit symbolic commitments. Before writing an intermediate result, the model may have a distributed representation containing several possible latent paths. Once it writes \u201cThe carry is 1,\u201d future predictions are conditioned on that discrete commitment. The model has selected a path.<\/p>\n\n\n\n<p>This can help \u2014 it reduces the space of downstream possibilities and makes the remaining computation simpler. But it can also harm. If the model commits to the wrong intermediate result, subsequent tokens build a coherent but incorrect continuation. The reasoning thus becomes confidently wrong because the commitment was wrong.<\/p>\n\n\n\n<p>This trade-off between commitment and flexibility points to something important. Research on latent reasoning suggests that continuous latent representations can maintain multiple possible reasoning paths simultaneously rather than committing to one.\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2412.06769\">Hao et al. (2024)<\/a>\u00a0showed in COCONUT that continuous thought vectors can encode multiple alternative next steps at once, enabling the model to perform what is essentially\u00a0<strong>breadth-first search<\/strong>\u00a0over the solution space \u2014 exploring many possibilities in parallel rather than committing to a single depth-first path as explicit CoT does. This idea has been further developed at ICLR 2026 by work on\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2505.23648\">continuous chain-of-thought (\\(CoT^2\\))<\/a>, which establishes theoretically how continuously-valued tokens enable models to track multiple discrete traces in parallel within a single inference pass.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Scratchpad tokens are not necessarily faithful reports of reasoning, but they can be active components of reasoning.<\/p>\n<\/blockquote>\n\n\n\n<p>So the picture could be genuinely complex. If the reasoning trace is not just a vehicle for forward passes, but neither is it a faithful transcript of internal computation, then the truth could sit uncomfortably in between. Understanding where exactly it sits may be crucial for designing the optimal form of reasoning architecture. This is one of our future research directions to investigate.<\/p>\n\n\n\n<p>In the meantime, it is useful to think of explicit CoT and latent reasoning as two ends of a spectrum. <\/p>\n\n\n\n<p>On one end, chain-of-thought externalises everything: intermediate steps are visible tokens that provide computation, memory, and discrete commitments \u2014 but at the cost of path commitment and potentially unfaithful traces. On the other end, latent reasoning internalises everything: iterative refinement happens entirely in hidden representation space, with no visible tokens, no discrete commitments, and the ability to maintain multiple reasoning paths simultaneously.<\/p>\n\n\n\n<p>Neither extreme is necessarily the right answer. The optimal form of reasoning may sit somewhere in between \u2014 perhaps combining the computational depth of latent iteration with selective externalisation of key intermediate state. Our goal is to verify where the balance lies, and then find it.<\/p>\n\n\n\n<p>In the meantime, studying the latent extreme is instructive. This is what looped transformers do.<\/p>\n\n\n\n<h2 id=\"from-scratchpad-to-latent-loop\" class=\"wp-block-heading\">From Scratchpad to Latent Loop<\/h2>\n\n\n\n<p>The Tiny Recursive Model (TRM) is a concrete instantiation of this idea. Instead of generating explicit reasoning tokens, TRM maintains two latent state vectors (\\(z_H\\) and \\(z_L\\)\u200b) and iterates them through a learned update function. The \u201cscratchpad\u201d is the hidden state itself, updated through 3 outer cycles and 4\u20136 inner cycles. No tokens are emitted during reasoning \u2014 the refinement happens entirely in representation space.<\/p>\n\n\n\n<p>The scale context is striking. TRM has roughly 7 <em>million<\/em> parameters. For reference, GPT-2 Small has 117 million. This is not a large model thinking hard; it is a tiny model thinking repeatedly. And yet it outperforms models orders of magnitude larger on ARC-AGI, a benchmark specifically designed to test abstract reasoning and generalisation.<\/p>\n\n\n\n<p>The recursive process \u2014 not scale \u2014 appears to be the key ingredient.<\/p>\n\n\n\n<p>This raises a natural question. If the power comes from the loop, does the specific operator being looped matter? TRM uses standard Transformer blocks: attention layers for cross-position communication, MLPs for per-position computation. But attention is not the only option.<\/p>\n\n\n\n<p>Mamba-2, a state space model, processes sequences through a recurrent state update:<\/p>\n\n\n\n<p>\\(h_t = a_th_{t-1} + B_tx_t\\)<\/p>\n\n\n\n<p>where the parameters are input-dependent, allowing the model to selectively propagate or forget information. This recurrence is itself a form of iterative refinement. There is a conceptual elegance to putting an inherently iterative operator inside an iterative loop \u2014 recurrence within recurrence.<\/p>\n\n\n\n<p>The practical question is whether Mamba-2 can enter the design space of operators for recursive reasoning without degrading capability. We tested this directly.<\/p>\n\n\n\n<h2 id=\"swapping-the-operator-results-on-arc-agi\" class=\"wp-block-heading\">Swapping the Operator: Results on ARC-AGI<\/h2>\n\n\n\n<p>In our\u00a0<a href=\"https:\/\/openreview.net\/pdf?id=AEz0zbLuzg\">workshop paper<\/a>, published at the Latent &amp; Implicit Thinking Workshop at ICLR 2026, we replaced the Transformer blocks in TRM with a Mamba-2 + Attention hybrid operator. We maintained parameter parity as the key constraint. The original TRM-attn has 6.83M parameters, and our hybrid TR-mamba2attn has 6.86M. We kept the same recursion schedule, same state representation, and same evaluation protocol.<\/p>\n\n\n\n<p>The headline result on ARC-AGI-1 is as follows:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th class=\"has-text-align-center\" data-align=\"center\">K<\/th><th class=\"has-text-align-center\" data-align=\"center\">TRM-attn<\/th><th class=\"has-text-align-center\" data-align=\"center\">TR-mamba2attn<\/th><th class=\"has-text-align-center\" data-align=\"center\">Delta<\/th><\/tr><\/thead><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\">1<\/td><td class=\"has-text-align-center\" data-align=\"center\">40.75<\/td><td class=\"has-text-align-center\" data-align=\"center\">40.50<\/td><td class=\"has-text-align-center\" data-align=\"center\">-0.25<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2<\/td><td class=\"has-text-align-center\" data-align=\"center\">43.88<\/td><td class=\"has-text-align-center\" data-align=\"center\">45.88<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>+2.00<\/strong><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">5<\/td><td class=\"has-text-align-center\" data-align=\"center\">49.25<\/td><td class=\"has-text-align-center\" data-align=\"center\">51.88<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>+2.63<\/strong><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">10<\/td><td class=\"has-text-align-center\" data-align=\"center\">52.13<\/td><td class=\"has-text-align-center\" data-align=\"center\">54.50<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>+2.37<\/strong><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">100<\/td><td class=\"has-text-align-center\" data-align=\"center\">60.50<\/td><td class=\"has-text-align-center\" data-align=\"center\">65.25<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>+4.75<\/strong><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">1000<\/td><td class=\"has-text-align-center\" data-align=\"center\">65.50<\/td><td class=\"has-text-align-center\" data-align=\"center\">69.75<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>+4.25<\/strong><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><\/p>\n<\/blockquote>\n\n\n\n<p>The hybrid improves pass@2 (the official ARC-AGI metric) by +2.0%, and the advantage grows at higher K values, reaching +4.75% at pass@100. Meanwhile, pass@1 is near-parity.<\/p>\n\n\n\n<p>This pattern is consistent throughout training \u2014 it is not a late-training artefact:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"681\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/05\/image-1-1024x681.png\" alt=\"\" class=\"wp-image-658\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/05\/image-1-1024x681.png 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/05\/image-1-300x200.png 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/05\/image-1-768x511.png 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/05\/image-1-1536x1022.png 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/05\/image-1-2048x1363.png 2048w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/05\/image-1-1320x878.png 1320w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Coverage vs selection<\/h3>\n\n\n\n<p>The pass@K pattern also reveals a potential\u00a0<strong>coverage vs selection<\/strong>\u00a0trade-off. The hybrid generates the correct solution within its candidate set more often (better coverage), but both models rank the correct solution first at similar rates (similar selection).<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"737\" src=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/05\/image-2-1024x737.png\" alt=\"\" class=\"wp-image-659\" srcset=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/05\/image-2-1024x737.png 1024w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/05\/image-2-300x216.png 300w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/05\/image-2-768x553.png 768w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/05\/image-2-1536x1106.png 1536w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/05\/image-2-2048x1475.png 2048w, https:\/\/fin.ai\/research\/wp-content\/uploads\/2026\/05\/image-2-1320x951.png 1320w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The hybrid generates 27% more unique candidates per ARC AGI puzzle (339.5 vs 266.6) with higher vote entropy (5.39 vs 4.56 bits). Conversely, TRM-attn concentrates 41.1% of votes on its top-1 candidate (vs 32.9% for the hybrid) with a larger top-1 margin (32.3% vs 24.0%).<\/p>\n\n\n\n<p>Mamba-2\u2019s sequential processing appears to contribute different solution trajectories during augmentation, increasing the diversity of the candidate pool without degrading the quality of the best prediction. It seems like the hybrid explores more, while the Transformer commits.<\/p>\n\n\n\n<p>This would map directly onto the conceptual framework: the hybrid is perhaps stronger at\u00a0<strong>search over solution paths<\/strong>, while the pure Transformer excels at\u00a0<strong>discrete commitment<\/strong>.<\/p>\n\n\n\n<p>The results on other benchmarks add nuance. On Sudoku (small 9&#215;9 grids), dense all-to-all mixing via MLP-t blocks outperformed both attention and hybrid models, suggesting constraint satisfaction benefits from a different communication pattern. On Maze (large 30&#215;30 grids), the hybrid achieved 80.6% vs 60.8% for TRM-attn, though training instability makes these results preliminary. It seems that different operators suit different tasks.<\/p>\n\n\n\n<h2 id=\"what-this-tells-us\" class=\"wp-block-heading\">What This Tells Us<\/h2>\n\n\n\n<p>The coverage vs selection trade-off is not just a quirk of one benchmark. It reflects something deeper about how different operators shape the distribution over latent reasoning paths.<\/p>\n\n\n\n<p>Return to the latent-variable view: \\(P(a|x) = \\sum_z P(a|x,z)P(z|x)\\). Different operators induce different distributions over&nbsp;\\(z\\). A broader&nbsp;\\(P(z|x)\\)&nbsp;means more diverse reasoning paths, which helps when you can marginalise over them \u2014 self-consistency, best-of-N, pass@K evaluation. A sharper&nbsp;\\(P(z|x)\\)&nbsp;helps when you need to commit to one path \u2014 greedy decoding, pass@1.<\/p>\n\n\n\n<p>The practical implication might be that the best operator for recursive reasoning may depend on your inference-time compute budget. If you can sample many candidates, diversity wins. If you get one shot, decisiveness wins.<\/p>\n\n\n\n<p>This is reinforced by a difficulty-stratified analysis from the paper. On hard puzzles (where neither model reliably produces the correct answer), the hybrid gains +4.9 percentage points at pass@5 over TRM-attn \u2014 its flatter vote distribution avoids concentrating on a single dominant-but-wrong candidate. On easy puzzles, the pattern reverses: TRM-attn\u2019s sharper concentration more reliably promotes an already-dominant correct answer to the top rank.<\/p>\n\n\n\n<p>Perhaps most telling is that at pass@5, the two models solve partially disjoint puzzle sets \u2014 31 hybrid-only vs 23 TRM-attn-only. Different mixing strategies contribute complementary strengths. At least given our current understanding, the operators are not interchangeable, but appear more complementary.<\/p>\n\n\n\n<p>For the full experimental details, architecture diagrams, and analysis, see the paper:&nbsp;<a href=\"https:\/\/openreview.net\/pdf?id=AEz0zbLuzg\">Tiny Recursive Reasoning with Mamba-2 Attention Hybrid<\/a>, published at the Latent &amp; Implicit Thinking Workshop at ICLR 2026.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Looped Transformers, thoughts on how LLMs Reason, and an overview of a recent paper we wrote.<\/p>\n","protected":false},"author":55,"featured_media":106,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"coauthors":[39],"class_list":["post-653","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v24.6 (Yoast SEO v24.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>What Does It Mean for a Model to &#039;Think&#039;? Reasoning, Recursion, and the Operator Design Space - \/research<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"What Does It Mean for a Model to &#039;Think&#039;? Reasoning, Recursion, and the Operator Design Space\" \/>\n<meta property=\"og:description\" content=\"Looped Transformers, thoughts on how LLMs Reason, and an overview of a recent paper we wrote.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/\" \/>\n<meta property=\"og:site_name\" content=\"\/research\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-21T14:47:27+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-05-21T14:56:49+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16-1024x683.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"683\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Wenlong Wang\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@intercom\" \/>\n<meta name=\"twitter:site\" content=\"@intercom\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Wenlong Wang\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"14 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/\"},\"author\":{\"name\":\"Wenlong Wang\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/6ddc8ae35e0073319240fb1b6b5eeaa4\"},\"headline\":\"What Does It Mean for a Model to &#8216;Think&#8217;? Reasoning, Recursion, and the Operator Design Space\",\"datePublished\":\"2026-05-21T14:47:27+00:00\",\"dateModified\":\"2026-05-21T14:56:49+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/\"},\"wordCount\":2986,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/fin.ai\/research\/#organization\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16.png\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/\",\"url\":\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/\",\"name\":\"What Does It Mean for a Model to 'Think'? Reasoning, Recursion, and the Operator Design Space - \/research\",\"isPartOf\":{\"@id\":\"https:\/\/fin.ai\/research\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16.png\",\"datePublished\":\"2026-05-21T14:47:27+00:00\",\"dateModified\":\"2026-05-21T14:56:49+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#primaryimage\",\"url\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16.png\",\"contentUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16.png\",\"width\":1920,\"height\":1280},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/fin.ai\/research\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"What Does It Mean for a Model to &#8216;Think&#8217;? Reasoning, Recursion, and the Operator Design Space\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/fin.ai\/research\/#website\",\"url\":\"https:\/\/fin.ai\/research\/\",\"name\":\"Intercom.ai\",\"description\":\"Insights and blogs from the AI Group building Fin\",\"publisher\":{\"@id\":\"https:\/\/fin.ai\/research\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/fin.ai\/research\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/fin.ai\/research\/#organization\",\"name\":\"Intercom.ai\",\"url\":\"https:\/\/fin.ai\/research\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png\",\"contentUrl\":\"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png\",\"width\":1024,\"height\":1024,\"caption\":\"Intercom.ai\"},\"image\":{\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/intercom\",\"https:\/\/www.linkedin.com\/company\/intercom\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/6ddc8ae35e0073319240fb1b6b5eeaa4\",\"name\":\"Wenlong Wang\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/fin.ai\/research\/#\/schema\/person\/image\/a5223ba1fe3710938f6ea3c260c324bb\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/58e608c7a84d8790975bc7b5266f199d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/58e608c7a84d8790975bc7b5266f199d?s=96&d=mm&r=g\",\"caption\":\"Wenlong Wang\"},\"description\":\"is a machine learning researcher whose current research focuses on latent reasoning and looped transformers.\",\"url\":\"https:\/\/fin.ai\/research\/author\/wenlong-wang\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"What Does It Mean for a Model to 'Think'? Reasoning, Recursion, and the Operator Design Space - \/research","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/","og_locale":"en_US","og_type":"article","og_title":"What Does It Mean for a Model to 'Think'? Reasoning, Recursion, and the Operator Design Space","og_description":"Looped Transformers, thoughts on how LLMs Reason, and an overview of a recent paper we wrote.","og_url":"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/","og_site_name":"\/research","article_published_time":"2026-05-21T14:47:27+00:00","article_modified_time":"2026-05-21T14:56:49+00:00","og_image":[{"width":1024,"height":683,"url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16-1024x683.png","type":"image\/png"}],"author":"Wenlong Wang","twitter_card":"summary_large_image","twitter_creator":"@intercom","twitter_site":"@intercom","twitter_misc":{"Written by":"Wenlong Wang","Est. reading time":"14 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#article","isPartOf":{"@id":"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/"},"author":{"name":"Wenlong Wang","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/6ddc8ae35e0073319240fb1b6b5eeaa4"},"headline":"What Does It Mean for a Model to &#8216;Think&#8217;? Reasoning, Recursion, and the Operator Design Space","datePublished":"2026-05-21T14:47:27+00:00","dateModified":"2026-05-21T14:56:49+00:00","mainEntityOfPage":{"@id":"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/"},"wordCount":2986,"commentCount":0,"publisher":{"@id":"https:\/\/fin.ai\/research\/#organization"},"image":{"@id":"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#primaryimage"},"thumbnailUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16.png","inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/","url":"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/","name":"What Does It Mean for a Model to 'Think'? Reasoning, Recursion, and the Operator Design Space - \/research","isPartOf":{"@id":"https:\/\/fin.ai\/research\/#website"},"primaryImageOfPage":{"@id":"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#primaryimage"},"image":{"@id":"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#primaryimage"},"thumbnailUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16.png","datePublished":"2026-05-21T14:47:27+00:00","dateModified":"2026-05-21T14:56:49+00:00","breadcrumb":{"@id":"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#primaryimage","url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16.png","contentUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/image-16.png","width":1920,"height":1280},{"@type":"BreadcrumbList","@id":"https:\/\/fin.ai\/research\/what-does-it-mean-for-a-model-to-think-reasoning-recursion-and-the-operator-design-space\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/fin.ai\/research\/"},{"@type":"ListItem","position":2,"name":"What Does It Mean for a Model to &#8216;Think&#8217;? Reasoning, Recursion, and the Operator Design Space"}]},{"@type":"WebSite","@id":"https:\/\/fin.ai\/research\/#website","url":"https:\/\/fin.ai\/research\/","name":"Intercom.ai","description":"Insights and blogs from the AI Group building Fin","publisher":{"@id":"https:\/\/fin.ai\/research\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/fin.ai\/research\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/fin.ai\/research\/#organization","name":"Intercom.ai","url":"https:\/\/fin.ai\/research\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/","url":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png","contentUrl":"https:\/\/fin.ai\/research\/wp-content\/uploads\/2025\/03\/favicon.png","width":1024,"height":1024,"caption":"Intercom.ai"},"image":{"@id":"https:\/\/fin.ai\/research\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/intercom","https:\/\/www.linkedin.com\/company\/intercom"]},{"@type":"Person","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/6ddc8ae35e0073319240fb1b6b5eeaa4","name":"Wenlong Wang","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/fin.ai\/research\/#\/schema\/person\/image\/a5223ba1fe3710938f6ea3c260c324bb","url":"https:\/\/secure.gravatar.com\/avatar\/58e608c7a84d8790975bc7b5266f199d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/58e608c7a84d8790975bc7b5266f199d?s=96&d=mm&r=g","caption":"Wenlong Wang"},"description":"is a machine learning researcher whose current research focuses on latent reasoning and looped transformers.","url":"https:\/\/fin.ai\/research\/author\/wenlong-wang\/"}]}},"_links":{"self":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts\/653","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/users\/55"}],"replies":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/comments?post=653"}],"version-history":[{"count":0,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/posts\/653\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/media\/106"}],"wp:attachment":[{"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/media?parent=653"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/categories?post=653"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/tags?post=653"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/fin.ai\/research\/wp-json\/wp\/v2\/coauthors?post=653"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}