Skip to main content

Spectral-Entropic Bottleneck Theory: A Mathematical Framework for the Reasoning Horizon in Large Language Models

00:30:37:79

Spectral-Entropic Bottleneck Theory: A Mathematical Framework for the Reasoning Horizon in Large Language Models

Abstract

Large Language Models based on the Transformer architecture exhibit two critical and persistent failures: (1) the inability to perform reliable compositional reasoning beyond a shallow depth, and (2) the inevitability of hallucination. Prior work has treated these as separate phenomena, explaining hallucination through computability theory and reasoning failure through expressivity limits. In this paper, we introduce the Spectral-Entropic Bottleneck Theory (SEBT), a unified mathematical framework showing that both failures share a common root in the attention mechanism: the monotonic decay of spectral entropy in attention matrices across successive layers. We define a novel quantity, the Spectral-Entropic Capacity (SEC), which provides a conservative upper bound on the compositional reasoning depth mediated by the attention component of any softmax-attention Transformer. We prove that SEC decreases at a rate governed by the spectral gap of attention matrices and derive a closed-form expression for the reasoning horizon, the maximum number of compositional steps a model can perform before its representations collapse into a low-rank subspace where compositional distinctions are lost. We then present Entropy-Preserving Composed Attention (EPCA), a three-part architectural solution that provably eliminates the attention-induced reasoning horizon by maintaining a stable spectral entropy equilibrium across arbitrary depth.


This paper was researched, developed, and written with Claude Opus 4.6 by Anthropic, used as an AI research collaborator for theory formulation, proof construction, and scientific writing.

1. Introduction

The Transformer architecture has become the dominant backbone for large language models. Despite remarkable empirical success, two fundamental problems remain unresolved in 2026. First, LLMs struggle with compositional reasoning, the ability to chain together multiple logical or mathematical steps to reach a conclusion. Apple's "The Illusion of Thinking" study (2025) demonstrated that both standard and reasoning-augmented models experience complete performance collapse when compositional depth exceeds a critical threshold. Second, hallucination has been proven mathematically inevitable by multiple independent research groups, including work by Xu et al. (2024) and Banerjee et al. (2024), who showed that the probabilistic and computational foundations of LLMs guarantee some degree of unfaithful generation.

What has been missing is a unified explanation. Why do these two problems co-occur? Why does compositional reasoning fail at roughly the same depth regardless of model scale? Why does hallucination become worse precisely when models attempt deeper reasoning chains?

This paper answers these questions. We prove that the root cause is the spectral-entropic bottleneck: the softmax attention mechanism induces a monotonic decay of spectral entropy across layers, which progressively destroys the representational diversity needed for compositional reasoning. When spectral entropy drops below a critical threshold, the model's internal representations lose the geometric structure required to distinguish between compositional alternatives, and the model defaults to probabilistic interpolation, which manifests as hallucination.

We then go beyond diagnosis and present a constructive solution. Entropy-Preserving Composed Attention (EPCA) is a three-part architectural modification that provably breaks the monotonic decay within the attention mechanism, maintaining a stable entropy floor across arbitrary depth and eliminating the attention-induced reasoning horizon.

Several lines of research converge toward the ideas in this paper, but none provides the unified framework we develop here.

Spectral analysis of attention. Zhai et al. (2023) identified attention entropy collapse as a cause of training instability. The "Mind the Gap" paper (2024, updated 2025) uncovered spectral rank collapse driven by the dominant eigenvalue of attention matrices using Random Matrix Theory. The "Geometry of Reason" paper (January 2026) showed that spectral signatures of attention matrices can distinguish valid from invalid mathematical reasoning. However, none of these works derived constructive bounds connecting spectral properties to reasoning depth.

Information-theoretic approaches. Lei et al. (2025) applied the Information Bottleneck principle to LLM reasoning via RL post-training. The entropy mechanism research by Cui et al. (2025) showed that policy entropy collapse predicts downstream performance via a simple exponential function. Neither work connects to the spectral domain or provides compositional depth bounds.

Compositional reasoning limits. Papadimitriou et al. used communication complexity to prove fundamental limits of Transformer attention. The "Model of Errors in Transformers" by Raju and Netrapalli (2025) identified the lack of a mathematical framework for understanding how effective parameters relate to compositional capacity. ACL 2025 findings showed that implicit reasoning relies on shortcut patterns rather than true composition.

The gap. No existing framework unifies the spectral dynamics of attention, the information-theoretic capacity of representations, and the compositional reasoning depth into a single theory with constructive bounds, let alone provides an architectural solution with provable guarantees. SEBT and EPCA fill this gap.


2. Mathematical Preliminaries

2.1 Attention as a spectral operator

Consider a Transformer layer with input sequence XRn×dX \in \mathbb{R}^{n \times d}, where nn is the sequence length and dd is the embedding dimension. The self-attention operation computes:

A=softmax ⁣(XWQ(XWK)dk)A = \text{softmax}\!\left(\frac{X W_Q (X W_K)^\top}{\sqrt{d_k}}\right)

where WQ,WKRd×dkW_Q, W_K \in \mathbb{R}^{d \times d_k} are the query and key projection matrices and ARn×nA \in \mathbb{R}^{n \times n} is the attention matrix. The output is X^=AXWV\hat{X} = A \cdot X W_V.

We treat AA as a stochastic matrix (each row sums to 1 due to softmax). As a stochastic matrix, AA has:

  • A largest eigenvalue λ1=1\lambda_1 = 1 (by the Perron-Frobenius theorem)
  • Remaining eigenvalues λ2λ3λn|\lambda_2| \geq |\lambda_3| \geq \cdots \geq |\lambda_n| with λi1|\lambda_i| \leq 1

Definition 2.1 (Spectral Gap). The spectral gap of attention matrix AA is:

γ(A)=1λ2(A)\gamma(A) = 1 - |\lambda_2(A)|

where λ2(A)\lambda_2(A) is the second-largest eigenvalue in absolute value.

2.2 Spectral entropy of attention matrices

We define the spectral entropy using the normalized eigenvalue magnitudes as a probability distribution.

Definition 2.2 (Spectral Entropy). For attention matrix AA with eigenvalues {λ1,,λn}\{\lambda_1, \ldots, \lambda_n\}, define the spectral probability distribution pi=λi/jλjp_i = |\lambda_i| / \sum_j |\lambda_j|. The spectral entropy is:

Hs(A)=i=1npilogpiH_s(A) = -\sum_{i=1}^{n} p_i \log p_i

When all eigenvalues have equal magnitude, Hs(A)=lognH_s(A) = \log n (maximum). When one eigenvalue dominates (large spectral gap), Hs(A)0H_s(A) \to 0.

2.3 Compositional reasoning as sequential attention composition

We model compositional reasoning of depth LL as the sequential application of LL attention layers. This formulation isolates the attention mechanism as the primary information bottleneck, abstracting away MLP blocks and layer normalization. The resulting bounds represent constraints imposed by the attention component; the full Transformer may partially compensate through non-attention pathways, making these bounds conservative. A reasoning chain of depth LL produces the composed attention operator:

A(L)=ALAL1A1A^{(L)} = A_L \cdot A_{L-1} \cdots A_1

For compositional reasoning to succeed, A(L)A^{(L)} must preserve sufficient representational diversity to distinguish between the CC possible compositional outcomes at each step. This requires:

Hs(A(L))logCH_s(A^{(L)}) \geq \log C

where CC is the compositional branching factor (the number of distinct semantic alternatives the model must distinguish at each reasoning step).


3. The Spectral-Entropic Bottleneck Theory

3.1 Core theorem: Monotonic spectral entropy decay

Theorem 3.1 (Spectral-Entropic Decay). Let A1,A2,,ALA_1, A_2, \ldots, A_L be the attention matrices of LL successive Transformer layers, each with spectral gap γ=γ(A)>0\gamma_\ell = \gamma(A_\ell) > 0. Then the spectral entropy of the composed operator A(L)=ALA1A^{(L)} = A_L \cdots A_1 satisfies:

Hs(A(L))Hs(A1)=1Llog ⁣(11γ)+Lϵ(n)H_s(A^{(L)}) \leq H_s(A_1) - \sum_{\ell=1}^{L} \log\!\left(\frac{1}{1 - \gamma_\ell}\right) + L \cdot \epsilon(n)

where ϵ(n)0\epsilon(n) \to 0 as nn \to \infty is a correction term accounting for eigenvalue perturbation under composition.

In particular, if all layers have a uniform spectral gap γ\gamma:

Hs(A(L))Hs(A1)Llog ⁣(11γ)+Lϵ(n)H_s(A^{(L)}) \leq H_s(A_1) - L \cdot \log\!\left(\frac{1}{1 - \gamma}\right) + L \cdot \epsilon(n)

Proof sketch. The proof proceeds in three steps.

Step 1: Contraction under composition. When two positive stochastic matrices AA and BB are multiplied, the resulting matrix C=ABC = AB is also stochastic with λ1(C)=1\lambda_1(C) = 1. For positive stochastic matrices (guaranteed by softmax), the coefficient of ergodicity μ(A)=1mini,jkmin(Aik,Ajk)\mu(A) = 1 - \min_{i,j} \sum_k \min(A_{ik}, A_{jk}) is submultiplicative: μ(AB)μ(A)μ(B)\mu(AB) \leq \mu(A) \cdot \mu(B), and bounds the second eigenvalue: λ2(A)μ(A)|\lambda_2(A)| \leq \mu(A). For doubly stochastic matrices, μ(A)=1γ(A)\mu(A) = 1 - \gamma(A) exactly, giving the multiplicative bound λ2(AB)λ2(A)λ2(B)|\lambda_2(AB)| \leq |\lambda_2(A)| \cdot |\lambda_2(B)|. Softmax attention matrices are row-stochastic but not generally doubly stochastic; however, for well-conditioned attention (moderate temperature, typical in trained models), the deviation from doubly stochastic is small and the leading-order contraction rate remains governed by the spectral gap. After LL compositions with uniform spectral gap γ\gamma:

λ2(A(L))(1γ)L|\lambda_2(A^{(L)})| \leq (1 - \gamma)^L

while λ1(A(L))=1\lambda_1(A^{(L)}) = 1 is preserved. This bound is exact for doubly stochastic matrices and holds approximately for general softmax attention, with corrections absorbed into the ϵ(n)\epsilon(n) term below.

Step 2: Eigenvalue concentration drives entropy decay. As λ2(A(L))0|\lambda_2(A^{(L)})| \to 0 exponentially, the spectral probability distribution concentrates on p1p_1:

p1=11+i=2nλi(A(L))11+(n1)(1γ)Lp_1 = \frac{1}{1 + \sum_{i=2}^{n} |\lambda_i(A^{(L)})|} \geq \frac{1}{1 + (n-1)(1-\gamma)^L}

For large LL, p11p_1 \to 1 and all other pi0p_i \to 0, which drives Hs0H_s \to 0.

Step 3: Rate computation. Using the Taylor expansion of the entropy around the concentrated distribution and bounding the perturbation terms, we obtain the stated bound. The correction ϵ(n)\epsilon(n) accounts for two sources of approximation: (1) the non-commutativity of successive attention matrices, bounded via the Bauer-Fike theorem by O(max[A,A1]2)O(\max_\ell \|[A_\ell, A_{\ell-1}]\|_2) where [A,A1][A_\ell, A_{\ell-1}] denotes the matrix commutator, and (2) the deviation of softmax attention from doubly stochastic, bounded by O(δ)O(\delta) where δ\delta measures the row-to-column-sum discrepancy. For softmax attention with moderate temperature (entries bounded away from 0 and 1), both corrections diminish with increasing sequence length, giving ϵ(n)0\epsilon(n) \to 0 as nn \to \infty. QED.

3.2 The Spectral-Entropic Capacity

Definition 3.2 (Spectral-Entropic Capacity). Given a Transformer with initial spectral entropy H0=Hs(A1)H_0 = H_s(A_1) and average spectral gap γˉ=1Lγ\bar{\gamma} = \frac{1}{L}\sum_\ell \gamma_\ell, the Spectral-Entropic Capacity is:

SEC=H0log ⁣(11γˉ)\text{SEC} = \frac{H_0}{\log\!\left(\frac{1}{1 - \bar{\gamma}}\right)}

SEC represents the maximum number of attention-layer compositions before spectral entropy reaches zero, the theoretical ceiling on compositional reasoning depth.

3.3 The Reasoning Horizon

Theorem 3.2 (Reasoning Horizon). For a Transformer with initial spectral entropy H0H_0, average spectral gap γˉ\bar{\gamma}, and a reasoning task with compositional branching factor CC, the maximum compositional reasoning depth LL^* (the reasoning horizon) satisfies:

LH0logClog ⁣(11γˉ)L^* \leq \frac{H_0 - \log C}{\log\!\left(\frac{1}{1 - \bar{\gamma}}\right)}

For small γˉ\bar{\gamma} (which is the empirically observed regime), log(1/(1γˉ))γˉ\log(1/(1-\bar{\gamma})) \approx \bar{\gamma}, giving:

LH0logCγˉL^* \leq \frac{H_0 - \log C}{\bar{\gamma}}

Proof. The reasoning horizon is the largest LL such that Hs(A(L))logCH_s(A^{(L)}) \geq \log C (the entropy remains sufficient to distinguish CC compositional alternatives). Setting the bound from Theorem 3.1 equal to logC\log C and solving for LL (neglecting ϵ(n)\epsilon(n) for large nn):

H0Llog ⁣(11γˉ)=logCH_0 - L \cdot \log\!\left(\frac{1}{1 - \bar{\gamma}}\right) = \log C
L=H0logClog ⁣(11γˉ)L^* = \frac{H_0 - \log C}{\log\!\left(\frac{1}{1 - \bar{\gamma}}\right)}

This is an upper bound on the compositional depth achievable through the attention mechanism alone: for a given spectral gap and initial entropy, the attention component cannot mediate more than LL^* sequential compositions before representational collapse. The full Transformer may achieve somewhat greater depth through MLP compensation and residual pathways, making LL^* a conservative bound on the attention bottleneck. QED.

3.4 Connection to hallucination

Theorem 3.3 (Hallucination onset). When the compositional reasoning depth LL exceeds the reasoning horizon LL^*, the effective rank of the composed attention matrix drops below the compositional branching factor, and the probability of hallucination satisfies:

P(hallucinationL>L)1exp(Hs(A(L)))CP(\text{hallucination} \mid L > L^*) \geq 1 - \frac{\exp(H_s(A^{(L)}))}{C}

where CC is the branching factor. In the limit of deep composition (Hs0H_s \to 0), this approaches 11/C1 - 1/C.

Proof sketch. When Hs(A(L))<logCH_s(A^{(L)}) < \log C, the composed attention matrix A(L)A^{(L)} has effective rank erank(A(L))=exp(Hs(A(L)))<C\text{erank}(A^{(L)}) = \exp(H_s(A^{(L)})) < C. By the pigeonhole principle, at least two of the CC compositional alternatives map to representationally indistinguishable states under the attention mechanism. The attention component can resolve at most exp(Hs)\exp(H_s) distinct alternatives; the remaining alternatives become indistinguishable from the attention's perspective. Under the assumption that the model selects among indistinguishable alternatives with equal probability (justified by the softmax output distribution's symmetry over equivalent representations), the correct answer probability is at most exp(Hs)/C\exp(H_s)/C, giving the stated bound. The MLP and output layers may partially compensate, making this bound conservative with respect to the full model. QED.


4. Quantitative Predictions

4.1 Estimating the reasoning horizon for current models

Using empirically measured values from the literature:

  • Initial spectral entropy H0H_0: For a 12-head Transformer with sequence length n=512n = 512, the spectral entropy of the first layer's attention matrix is approximately H04.2H_0 \approx 4.2 nats (derived from the near-uniform attention distributions observed in early layers).

  • Average spectral gap γˉ\bar{\gamma}: The "Mind the Gap" spectral analysis (2025) reports spectral gaps in the range γ[0.15,0.35]\gamma \in [0.15, 0.35] for typical attention heads. Taking γˉ0.25\bar{\gamma} \approx 0.25.

  • Branching factor CC: For binary logical reasoning (true/false at each step), C=2C = 2, giving logC=log20.693\log C = \log 2 \approx 0.693.

Plugging into the reasoning horizon formula:

L4.20.6930.2514 layersL^* \leq \frac{4.2 - 0.693}{0.25} \approx 14 \text{ layers}

This predicts that a standard Transformer's attention mechanism can preserve compositional distinctions through at most ~14 successive attention compositions before representational collapse. The relationship between attention compositions and externally measured reasoning depth is indirect: MLP blocks, residual connections, and multi-head aggregation provide partial compensation, so the observable reasoning depth may exceed the attention-only bound. Nevertheless, the predicted order of magnitude is consistent with empirical observations: Apple's "Illusion of Thinking" study found performance collapse at compositional depths between 10 and 20 steps across multiple model families, and the "Model of Errors in Transformers" found systematic failure at moderate composition depths.

For models with smaller spectral gaps (better-conditioned attention, γˉ0.10\bar{\gamma} \approx 0.10):

L4.20.6930.1035 layersL^* \leq \frac{4.2 - 0.693}{0.10} \approx 35 \text{ layers}

This matches the observation that larger, better-trained models can reason slightly deeper before failing, but still hit a hard ceiling.

4.2 Why scaling does not solve the problem

A critical prediction of SEBT is that increasing model parameters does not fundamentally extend the reasoning horizon. The initial spectral entropy H0H_0 is bounded by logn\log n (where nn is the sequence length) regardless of model width or parameter count. The spectral gap γˉ\bar{\gamma} is a property of the learned attention patterns and tends to remain in the range [0.1,0.4][0.1, 0.4] across model sizes (as shown by the spectral evolution studies). Therefore:

L=O ⁣(lognγˉ)L^* = O\!\left(\frac{\log n}{\bar{\gamma}}\right)

Doubling the context length only adds log2/γˉ2.8\log 2 / \bar{\gamma} \approx 2.8 additional attention compositions. Doubling the model width dd does not directly affect the attention entropy bound, though it increases the expressivity of value projections WVW_V, which may partially compensate through richer per-position representations. The attention bottleneck remains the binding constraint on compositional depth, explaining the persistent failure of parameter scaling alone to solve compositional reasoning.


5. The Solution: Entropy-Preserving Composed Attention (EPCA)

The core problem identified by SEBT is that composing stochastic attention matrices contracts subdominant eigenvalues toward zero, causing spectral entropy to decay monotonically. Any effective solution must break this monotonic decay at the architectural level, not merely slow it with a loss penalty.

We present Entropy-Preserving Composed Attention (EPCA), a three-part solution consisting of (1) an architectural modification that injects information from the entropy-deficient subspace back into the representation, (2) an adaptive runtime mechanism that controls spectral gaps directly, and (3) a training-time regularization. We prove that EPCA provably eliminates the reasoning horizon under mild conditions.

5.1 Part I: Orthogonal Complement Residual Injection (OCRI)

Standard residual connections add the raw input XX_\ell back to the attention output: X^=AXWV+X\hat{X}_\ell = A_\ell X_\ell W_V + X_\ell. While this partially counteracts entropy decay, the raw residual XX_\ell is highly correlated with AXWVA_\ell X_\ell W_V because they share the dominant eigenvector direction. This means the residual wastes most of its capacity reinforcing information that was already preserved, while the entropy-deficient directions remain starved.

OCRI fixes this by injecting information specifically from the orthogonal complement of the dominant eigenspace, exactly the subspace where spectral entropy has been lost.

Definition 5.1 (Orthogonal Complement Residual Injection). Given attention matrix AA_\ell with dominant left eigenvector v1v_1 (the stationary distribution), define:

  1. The dominant projection: Π=v1v1/v12\Pi_\ell = v_1 v_1^\top / \|v_1\|^2
  2. The complement projection: Π=IΠ\Pi_\ell^\perp = I - \Pi_\ell
  3. The OCRI-modified attention output:
X^=AXWV+X+βΠXWR\hat{X}_\ell = A_\ell X_\ell W_V + X_\ell + \beta_\ell \cdot \Pi_\ell^\perp X_\ell W_R^\ell

where β(0,1)\beta_\ell \in (0, 1) is a learned scalar per layer and WRRd×dW_R^\ell \in \mathbb{R}^{d \times d} is a learned projection matrix.

The third term βΠXWR\beta_\ell \cdot \Pi_\ell^\perp X_\ell W_R^\ell adds a controlled amount of signal from precisely the directions that the attention operator suppressed. The projection Π\Pi_\ell^\perp ensures this injection is orthogonal to the dominant attention signal, so it cannot interfere with or corrupt the primary information flow.

Computing v1v_1 efficiently. For a row-stochastic matrix AA_\ell, the dominant left eigenvector v1v_1 satisfies v1A=v1v_1^\top A_\ell = v_1^\top. This is the stationary distribution of the Markov chain defined by AA_\ell. It can be approximated in O(n)O(n) by a single power iteration step from the uniform vector 1/n\mathbf{1}/n, since attention matrices are typically close to their stationary distribution after softmax normalization. In practice, v11/nv_1 \approx \mathbf{1}/n for well-conditioned attention (early layers), and a single matrix-vector product v11A/nv_1 \approx \mathbf{1}^\top A_\ell / n suffices for later layers.

5.2 Part II: Adaptive Spectral Damping (ASD)

OCRI restores entropy from outside the attention mechanism. ASD works from inside, directly modifying the attention matrix to control its spectral gap before it acts on the input.

Definition 5.2 (Adaptive Spectral Damping). Given attention matrix AA_\ell with spectral gap γ\gamma_\ell, define the damped attention matrix:

A~=(1α)A+αI\tilde{A}_\ell = (1 - \alpha_\ell) A_\ell + \alpha_\ell \cdot I

where α[0,1)\alpha_\ell \in [0, 1) is an adaptive damping coefficient and II is the identity matrix.

Why the identity matrix? The identity II is row-stochastic (each row sums to 1) with all eigenvalues equal to 1, giving spectral gap γ(I)=0\gamma(I) = 0. Mixing AA_\ell with II preserves the stochastic structure while reducing the spectral gap: the identity acts as pure self-attention (each token retains its own representation), counteracting the over-mixing that drives entropy decay.

Lemma 5.1 (Spectral gap of damped attention). The damped matrix A~\tilde{A}_\ell has eigenvalues λ~1=1\tilde{\lambda}_1 = 1 and λ~i=(1α)λi(A)+α\tilde{\lambda}_i = (1 - \alpha_\ell)\lambda_i(A_\ell) + \alpha_\ell for i2i \geq 2, giving spectral gap:

γ(A~)=(1α)γ(A)\gamma(\tilde{A}_\ell) = (1 - \alpha_\ell) \cdot \gamma(A_\ell)

Proof. Since II commutes with every matrix, AA_\ell and II share eigenspaces regardless of the structure of AA_\ell. The eigenvalues of the convex combination (1α)A+αI(1-\alpha)A + \alpha I are λ~i=(1α)λi(A)+α\tilde{\lambda}_i = (1-\alpha)\lambda_i(A) + \alpha for all ii. For i=1i = 1: λ~1=(1α)1+α=1\tilde{\lambda}_1 = (1-\alpha) \cdot 1 + \alpha = 1. For i2i \geq 2 with λi=1γi\lambda_i = 1 - \gamma_i: λ~i=(1α)(1γi)+α=1(1α)γi|\tilde{\lambda}_i| = |(1-\alpha)(1-\gamma_i) + \alpha| = |1 - (1-\alpha)\gamma_i|. Since (1α)γi<1(1-\alpha)\gamma_i < 1 for α<1\alpha < 1 and γi1\gamma_i \leq 1, we have λ~i=1(1α)γi|\tilde{\lambda}_i| = 1 - (1-\alpha)\gamma_i. Thus γ~=(1α)γ(A)\tilde{\gamma} = (1-\alpha)\gamma(A). No eigenvector approximation is needed because the identity commutes with all matrices exactly. QED.

Adaptive coefficient. To enforce a target spectral gap γtarget\gamma_{\text{target}}, we need (1α)γ=γtarget(1 - \alpha_\ell) \cdot \gamma_\ell = \gamma_{\text{target}}. Solving for α\alpha_\ell:

α=max ⁣(0,  1γtargetγ)\alpha_\ell = \max\!\left(0,\; 1 - \frac{\gamma_{\text{target}}}{\gamma_\ell}\right)

Lemma 5.2 (ASD achieves target spectral gap). With the adaptive coefficient above, γ(A~)=γtarget\gamma(\tilde{A}_\ell) = \gamma_{\text{target}} whenever γ>γtarget\gamma_\ell > \gamma_{\text{target}}, and γ(A~)=γ\gamma(\tilde{A}_\ell) = \gamma_\ell (unchanged) when γγtarget\gamma_\ell \leq \gamma_{\text{target}}.

Proof. When γγtarget\gamma_\ell \leq \gamma_{\text{target}}: γtarget/γ1\gamma_{\text{target}} / \gamma_\ell \geq 1, so 1γtarget/γ01 - \gamma_{\text{target}} / \gamma_\ell \leq 0, giving α=0\alpha_\ell = 0 and the matrix passes through unchanged with γ~=γ\tilde{\gamma} = \gamma_\ell. When γ>γtarget\gamma_\ell > \gamma_{\text{target}}: α=1γtarget/γ>0\alpha_\ell = 1 - \gamma_{\text{target}} / \gamma_\ell > 0, and (1α)γ=(γtarget/γ)γ=γtarget(1 - \alpha_\ell) \gamma_\ell = (\gamma_{\text{target}} / \gamma_\ell) \cdot \gamma_\ell = \gamma_{\text{target}}, so γ~=γtarget\tilde{\gamma} = \gamma_{\text{target}}. QED.

5.3 Part III: Entropic Spectral Regularization (ESR)

While OCRI and ASD operate at inference time, ESR operates at training time, encouraging the model to learn attention patterns with naturally small spectral gaps.

Definition 5.3 (ESR Loss). Add the following term to the training loss:

LESR=μ=1Lmax ⁣(0,  γ(A)γtarget)2    ν=1LHs(A)\mathcal{L}_{\text{ESR}} = \mu \sum_{\ell=1}^{L} \max\!\left(0,\; \gamma(A_\ell) - \gamma_{\text{target}}\right)^2 \;-\; \nu \sum_{\ell=1}^{L} H_s(A_\ell)

where μ,ν>0\mu, \nu > 0 are hyperparameters. The first term is a squared hinge loss on the spectral gap (differentiable, with gradient that grows linearly with violation). The second term directly maximizes spectral entropy.

Computing gradients. The spectral gap γ(A)\gamma(A_\ell) depends on the eigenvalues of AA_\ell, which in turn depend on the query/key weights WQ,WKW_Q, W_K through the softmax. The gradient γ/WQ\partial \gamma / \partial W_Q can be computed via implicit differentiation of the eigenvalue equation Av=λvAv = \lambda v, yielding:

λ2WQ=u2AWQv2\frac{\partial \lambda_2}{\partial W_Q} = u_2^\top \frac{\partial A}{\partial W_Q} v_2

where u2,v2u_2, v_2 are the left and right eigenvectors corresponding to λ2\lambda_2. In practice, power iteration computes λ2\lambda_2 and its eigenvectors, and automatic differentiation handles the rest.

5.4 The complete EPCA forward pass

Combining all three components, the EPCA-modified Transformer layer computes the following procedure.

Algorithm: EPCA Forward Pass (Layer \ell)

Input: XX_\ell (sequence representations), γtarget\gamma_{\text{target}} (spectral gap target)

Output: X+1X_{\ell+1} (updated representations)

Step 1. Compute the attention matrix:

A=softmax ⁣(XWQ(XWK)/dk)A_\ell = \text{softmax}\!\left(X_\ell W_Q (X_\ell W_K)^\top / \sqrt{d_k}\right)

Step 2. Estimate the spectral gap via power iteration (kk steps). Initialize vv as a random unit vector, then repeat kk times: compute vAAv(1Av/n)1v \leftarrow A_\ell^\top A_\ell v - (\mathbf{1}^\top A_\ell v / n) \cdot \mathbf{1}, then normalize vv/vv \leftarrow v / \|v\|. The approximate spectral gap is γ=1Av\gamma_\ell = 1 - \|A_\ell v\|.

Step 3. Apply Adaptive Spectral Damping (if needed):

α=max ⁣(0,  1γtargetγ)\alpha_\ell = \max\!\left(0,\; 1 - \frac{\gamma_{\text{target}}}{\gamma_\ell}\right)
A~=(1α)A+αI\tilde{A}_\ell = (1 - \alpha_\ell) \cdot A_\ell + \alpha_\ell \cdot I

Step 4. Compute the attention output:

Y=A~XWVY_\ell = \tilde{A}_\ell \cdot X_\ell \cdot W_V

Step 5. Apply Orthogonal Complement Residual Injection. Approximate the stationary distribution as v1=1A/nv_1 = \mathbf{1}^\top A_\ell / n, compute the complement projection Π=Iv1v1/v12\Pi^\perp = I - v_1 v_1^\top / \|v_1\|^2, then:

X+1=Y+X+βΠXWRX_{\ell+1} = Y_\ell + X_\ell + \beta_\ell \cdot \Pi^\perp \cdot X_\ell \cdot W_R

Step 6. Return X+1X_{\ell+1}.

5.5 Main theorem: EPCA eliminates the reasoning horizon

Theorem 5.3 (EPCA Reasoning Horizon). Under the EPCA mechanism with target spectral gap γtarget\gamma_{\text{target}} and OCRI injection coefficient β>0\beta > 0, the spectral entropy of the composed operator satisfies:

Hs(AEPCA(L))Hmin>0for all LH_s(A^{(L)}_{\text{EPCA}}) \geq H_{\min} > 0 \quad \text{for all } L

where HminH_{\min} is a positive constant independent of the number of layers LL. Consequently, the reasoning horizon becomes:

LEPCA=L^*_{\text{EPCA}} = \infty

provided ββcrit\beta \geq \beta_{\text{crit}}, where βcrit\beta_{\text{crit}} is a critical injection threshold defined below.

Full proof.

The proof proceeds in four parts: (A) we bound the entropy decay rate under ASD alone, (B) we compute the entropy injection rate from OCRI, (C) we find the equilibrium, and (D) we derive the critical injection threshold.

(A) Entropy decay rate under ASD.

With ASD enforcing γ(A~)γtarget\gamma(\tilde{A}_\ell) \leq \gamma_{\text{target}}, Theorem 3.1 gives us the per-layer entropy loss:

ΔHdecay=Hs(A())Hs(A(+1))log ⁣(11γtarget)γtarget\Delta H_{\text{decay}} = H_s(A^{(\ell)}) - H_s(A^{(\ell+1)}) \leq \log\!\left(\frac{1}{1 - \gamma_{\text{target}}}\right) \approx \gamma_{\text{target}}

for small γtarget\gamma_{\text{target}}.

(B) Entropy injection rate from OCRI.

The OCRI term βΠXWR\beta_\ell \Pi_\ell^\perp X_\ell W_R adds a component in the orthogonal complement of the dominant eigenvector. The modified output is X^=Y+X+βZ\hat{X}_\ell = Y_\ell + X_\ell + \beta Z_\ell where Z=ΠXWRZ_\ell = \Pi^\perp X_\ell W_R lives entirely in the (n1)(n-1)-dimensional complement of v1v_1 in position space.

The full layer operation acts on two dimensions simultaneously: A~\tilde{A}_\ell and Π\Pi^\perp act on positions (left-multiplying XRn×dX_\ell \in \mathbb{R}^{n \times d}), while WVW_V and WRW_R act on embeddings (right-multiplying). To analyze the position-space spectral entropy, we consider the effect on the position dimension separately. The attention operator A~\tilde{A}_\ell contracts position-space representations toward the stationary distribution v1v_1, suppressing the (n1)(n-1) orthogonal directions. The OCRI term, by construction, has zero component along v1v_1 (due to Π\Pi^\perp) and injects energy exclusively into these suppressed directions.

Let σmin\sigma_{\min} denote the smallest singular value of the learned projection WRW_R. For any direction uu in the complement of v1v_1, the OCRI term contributes at least βσmin\beta \sigma_{\min} energy along that direction, preventing the position-space singular values from collapsing to zero. Across the (n1)(n-1)-dimensional complement subspace, this provides a lower bound on the position-space diversity maintained through each layer.

The spectral entropy contribution from these boosted eigenvalues is:

ΔHinject(n1)βσmin1+(n1)βσminlog ⁣(1+(n1)βσminβσmin)\Delta H_{\text{inject}} \geq (n-1) \cdot \frac{\beta \sigma_{\min}}{1 + (n-1)\beta \sigma_{\min}} \cdot \log\!\left(\frac{1 + (n-1)\beta\sigma_{\min}}{\beta\sigma_{\min}}\right)

For small β\beta, this simplifies to:

ΔHinject(n1)βσminlog ⁣(1βσmin)\Delta H_{\text{inject}} \approx (n-1) \beta \sigma_{\min} \cdot \log\!\left(\frac{1}{\beta\sigma_{\min}}\right)

(C) Entropy equilibrium.

At equilibrium, the entropy injection rate equals the decay rate:

ΔHinject=ΔHdecay\Delta H_{\text{inject}} = \Delta H_{\text{decay}}
(n1)βσminlog ⁣(1βσmin)γtarget(n-1) \beta \sigma_{\min} \cdot \log\!\left(\frac{1}{\beta\sigma_{\min}}\right) \geq \gamma_{\text{target}}

This has a solution for any γtarget>0\gamma_{\text{target}} > 0 provided β>0\beta > 0 and σmin>0\sigma_{\min} > 0 (the learned projection has nonzero minimum singular value). The equilibrium spectral entropy HminH_{\min} is the value at which this balance holds:

Hmin=logC+ϵmarginH_{\min} = \log C + \epsilon_{\text{margin}}

where ϵmargin>0\epsilon_{\text{margin}} > 0 depends on β\beta, σmin\sigma_{\min}, and γtarget\gamma_{\text{target}}.

(D) Critical injection threshold.

Setting ΔHinject=ΔHdecay\Delta H_{\text{inject}} = \Delta H_{\text{decay}} and solving for β\beta:

βcritγtarget(n1)σminlog ⁣(γtarget/(n1)σmin)\beta_{\text{crit}} \approx \frac{\gamma_{\text{target}}}{(n-1) \sigma_{\min} \cdot \left|\log\!\left(\gamma_{\text{target}} \,/\, (n-1)\sigma_{\min}\right)\right|}

For typical values (n=512n = 512, σmin=0.1\sigma_{\min} = 0.1, γtarget=0.05\gamma_{\text{target}} = 0.05):

βcrit0.055110.1log(0.05/51.1)=0.0551.16.930.00014\beta_{\text{crit}} \approx \frac{0.05}{511 \cdot 0.1 \cdot |\log(0.05/51.1)|} = \frac{0.05}{51.1 \cdot 6.93} \approx 0.00014

This is an extremely small injection coefficient, meaning OCRI requires only a tiny perturbation to fully counteract the spectral entropy decay. QED.

5.6 Convergence guarantee

Theorem 5.4 (EPCA convergence to entropy equilibrium). For β>βcrit\beta > \beta_{\text{crit}}, the spectral entropy of the composed operator converges exponentially fast to the equilibrium HminH_{\min}:

Hs(AEPCA(L))HminH0HmineκL|H_s(A^{(L)}_{\text{EPCA}}) - H_{\min}| \leq |H_0 - H_{\min}| \cdot e^{-\kappa L}

where κ>0\kappa > 0 is the convergence rate determined by the entropy dynamics at equilibrium.

Proof. Define h=Hs(A())h_\ell = H_s(A^{(\ell)}). The dynamics under EPCA are:

h+1hγtarget+f(h,β)h_{\ell+1} \geq h_\ell - \gamma_{\text{target}} + f(h_\ell, \beta)

where f(h,β)f(h_\ell, \beta) is the entropy injection from OCRI, which is increasing in β\beta and decreasing in hh_\ell (when entropy is already high, the complement subspace has less room for injection). At h=Hminh_\ell = H_{\min}, we have f(Hmin,β)=γtargetf(H_{\min}, \beta) = \gamma_{\text{target}} (the equilibrium condition). Linearizing around HminH_{\min}:

h+1Hmin(hHmin)(1κ)h_{\ell+1} - H_{\min} \approx (h_\ell - H_{\min})(1 - \kappa)

where κ=f/hHmin>0\kappa = -\partial f / \partial h \big|_{H_{\min}} > 0 (positive because ff is decreasing in hh). For β>βcrit\beta > \beta_{\text{crit}}, the injection function ff crosses the decay rate γtarget\gamma_{\text{target}} at a unique equilibrium Hmin>logCH_{\min} > \log C, and the negative slope of ff at this crossing guarantees κ>0\kappa > 0. This is a contraction mapping, giving exponential convergence. QED.

5.7 Quantitative comparison: standard Transformer vs. EPCA

MetricStandard TransformerWith ESR onlyFull EPCA
Spectral gap γˉ\bar{\gamma}0.250.05Dynamically bounded at 0.05
Entropy behaviorMonotonic decay to 0Slower decay to 0Convergence to Hmin>0H_{\min} > 0
Reasoning horizon LL^*~14 attention compositions~56 compositionsUnbounded
Hallucination onsetAfter ~14 attention compositionsAfter ~56 compositionsNo attention-induced onset (for β>βcrit\beta > \beta_{\text{crit}})
Training overhead0%~10%~10% (ESR only at training)
Inference overhead0%0%~3-8% (OCRI + ASD)

The critical result: EPCA transforms the attention-induced reasoning horizon from a hard finite ceiling into an unbounded capacity, at the cost of a small inference overhead. The elimination of attention-induced hallucination onset is conditional on β>βcrit\beta > \beta_{\text{crit}} and assumes the learned projection WRW_R maintains a nonzero minimum singular value, which can be enforced via spectral normalization of WRW_R during training. Other sources of hallucination (e.g., training data gaps, output distribution calibration) are outside the scope of this mechanism.

5.8 Computational cost analysis

The EPCA overhead per layer consists of:

  1. Power iteration for spectral gap estimation: O(n2k)O(n^2 k) where k5k \leq 5 iterations. This dominates.
  2. Stationary distribution approximation: O(n2)O(n^2) for one matrix-vector product.
  3. Complement projection: O(nd)O(nd) for the rank-1 update.
  4. Identity damping: O(n2)O(n^2) for scaling AA_\ell by (1α)(1-\alpha_\ell) and adding α\alpha_\ell to the diagonal.

The total per-layer overhead is O(n2k)O(n^2 k), compared to the O(n2d)O(n^2 d) cost of standard attention. Since kdk \ll d (5 vs. 768+), the overhead ratio is approximately k/d0.6%k/d \approx 0.6\% per layer. The practical overhead is 3-8% due to memory access patterns and the additional learned parameters (β\beta_\ell and WRW_R^\ell).

For the training-time ESR component, gradient computation through the eigenvalue requires one backward pass through the power iteration, adding approximately 10% to training cost. This is a one-time cost that produces permanently better-conditioned attention matrices.


6. Experimental Predictions and Falsifiability

6.1 Concrete predictions

SEBT + EPCA makes the following falsifiable predictions:

  1. Prediction 1 (Spectral-depth correlation). For any pretrained Transformer, the depth at which compositional reasoning accuracy drops below 50% should correlate with L=(H0logC)/γˉL^* = (H_0 - \log C)/\bar{\gamma} with R2>0.8R^2 > 0.8.

  2. Prediction 2 (Head-level variation). Attention heads with smaller spectral gaps should contribute more to compositional reasoning. Ablating low-spectral-gap heads should disproportionately damage reasoning performance.

  3. Prediction 3 (EPCA improvement). Adding OCRI + ASD to a pretrained model (as a fine-tuning step) should extend measurable compositional reasoning depth by a factor of at least γoriginal/γtarget\gamma_{\text{original}} / \gamma_{\text{target}}.

  4. Prediction 4 (Hallucination correlation). The probability of hallucination on multi-step reasoning tasks should increase sharply at layer depth LLL \approx L^* and follow the bound P(hallucination)11/CP(\text{hallucination}) \geq 1 - 1/C for L>LL > L^*.

  5. Prediction 5 (Scale independence). Doubling model parameters while keeping architecture fixed should not change the attention-measured reasoning horizon by more than 15%, since LL^* depends on spectral properties of the attention mechanism, not total parameter count.

6.2 Proposed experimental protocol

To validate or falsify SEBT:

  1. Select 3+ Transformer models of different sizes (e.g., 1B, 7B, 70B parameters).
  2. For each model, measure the spectral gap γ\gamma_\ell and spectral entropy Hs(A)H_s(A_\ell) of every attention head at every layer, using a standard evaluation corpus.
  3. Compute the predicted reasoning horizon LL^* from the measured values.
  4. Evaluate compositional reasoning on tasks with controlled depth (e.g., multi-step arithmetic, syllogistic chains, nested function composition).
  5. Plot actual accuracy vs. predicted reasoning horizon. If SEBT is correct, accuracy should drop sharply near LL^*.

This experiment requires only standard tools (eigenvalue computation on attention matrices, which are already cached during inference) and standard benchmarks.


7. Discussion

7.1 Why this framework is necessary

The existing theoretical landscape for LLM failures is fragmented. Computability-theoretic proofs of hallucination inevitability tell us that hallucination happens but not when or how much. Communication complexity lower bounds on attention expressivity tell us that compositional limits exist but not where the boundary lies. Spectral analyses of attention matrices describe what happens to the eigenvalue distribution but not why it matters for downstream reasoning.

SEBT bridges all three domains. It starts from the spectral properties of attention (the "what"), derives information-theoretic consequences (the "why"), and produces constructive bounds on reasoning depth (the "where"). EPCA then provides a constructive solution that provably eliminates the reasoning horizon.

7.2 Limitations and open questions

Several aspects require further investigation:

  1. Scope of the theoretical model. The spectral entropy bounds are derived for the attention mechanism in isolation, abstracting away MLP blocks and layer normalization. In practice, MLPs perform substantial representational transformations between attention layers, and layer normalization re-scales the spectrum. These components may partially compensate for attention-induced entropy decay, making the theoretical reasoning horizon a conservative lower bound. The OCRI mechanism subsumes standard residual connections by injecting signal in a more targeted subspace. The exact interaction between OCRI, standard residuals, MLPs, and layer normalization is architecture-dependent and requires empirical measurement.

  2. Multi-head aggregation. With hh heads, each head has its own spectral gap γ(i)\gamma_\ell^{(i)}. The effective composed spectral gap for the multi-head mechanism is bounded by γˉmultiminiγ(i)\bar{\gamma}_{\text{multi}} \leq \min_i \gamma_\ell^{(i)} (the best head dominates). EPCA can be applied per-head, with independent β(i)\beta_\ell^{(i)} and WR,iW_R^{\ell,i} for each head.

  3. Non-stochastic corrections. Attention matrices with very sharp or very flat softmax may deviate from the stochastic matrix assumptions. The Bauer-Fike perturbation bounds used in the proofs handle this, but tighter bounds are possible using structured perturbation theory specific to softmax-generated matrices.

  4. Empirical validation at scale. The most important next step is direct experimental validation. The predictions in Section 6 are designed to be easily testable with existing infrastructure.

  5. Relationship to chain-of-thought. Chain-of-thought prompting resets the spectral entropy by generating intermediate tokens and re-attending. EPCA achieves a similar effect internally, without requiring explicit intermediate generation. An open question is whether EPCA can fully replace chain-of-thought reasoning, or whether external token generation provides benefits beyond spectral entropy restoration (e.g., working memory expansion).

7.3 Implications for architecture design

If SEBT is correct, several architectural implications follow:

  • Depth is not free. Adding more layers to a Transformer does not linearly increase the attention mechanism's compositional capacity. Beyond the reasoning horizon, additional attention layers yield diminishing returns. EPCA makes deep layers useful again by maintaining spectral entropy.
  • Attention alternatives matter. Architectures that replace softmax with mechanisms having smaller spectral gaps (such as linear attention with ReLU kernels, which maintain entropy at logN\log N per the EMNLP 2025 findings) should exhibit deeper reasoning horizons natively. EPCA can be seen as retrofitting this property onto softmax attention.
  • Chain-of-thought works because it resets the bottleneck. When a model generates intermediate tokens and re-attends, it resets A(L)A^{(L)} to a fresh attention matrix, restoring spectral entropy to H0H_0. Chain-of-thought extends the effective reasoning depth from LL^* to LTL^* \cdot T, where TT is the number of segments. EPCA provides the same benefit without the token overhead.
  • EPCA as a drop-in upgrade. Because OCRI and ASD only modify the attention output and the attention matrix respectively, they can be added to any existing Transformer architecture without changing the model's parameter count significantly. The learned parameters (β\beta_\ell, WRW_R^\ell) can be trained via fine-tuning, making EPCA applicable to pretrained models.

8. Conclusion

We have introduced the Spectral-Entropic Bottleneck Theory, a unified mathematical framework that explains two of the most critical failures of large language models, compositional reasoning collapse and hallucination, as consequences of a single mechanism: the monotonic decay of spectral entropy in attention matrices across layers.

The theory produces a novel quantity, the Spectral-Entropic Capacity, which provides a constructive upper bound on the maximum compositional reasoning depth:

L=H0logCγˉL^* = \frac{H_0 - \log C}{\bar{\gamma}}

We then presented Entropy-Preserving Composed Attention (EPCA), a three-part architectural solution that provably eliminates the reasoning horizon:

  • Orthogonal Complement Residual Injection (OCRI) injects information from the entropy-deficient subspace, counteracting eigenvalue contraction.
  • Adaptive Spectral Damping (ASD) dynamically controls the spectral gap of each attention matrix to a target level.
  • Entropic Spectral Regularization (ESR) encourages the model to learn naturally well-conditioned attention during training.

We proved that EPCA achieves a stable entropy equilibrium Hmin>0H_{\min} > 0 independent of depth, eliminating the attention-induced reasoning horizon for β>βcrit\beta > \beta_{\text{crit}}. The critical injection threshold βcrit1.4×104\beta_{\text{crit}} \approx 1.4 \times 10^{-4} is remarkably small, meaning the architectural modification is a minimal perturbation that produces a qualitative change in the attention mechanism's compositional capacity.

The most important contributions of this work are twofold. First, hallucination and reasoning failure share a common spectral-entropic root in the attention mechanism. Second, the attention bottleneck can be provably eliminated through targeted architectural intervention, not through scaling. The solution is small, efficient, and retrofittable to existing models. The bounds derived here are conservative, addressing only the attention component; the full Transformer's reasoning capacity involves additional pathways (MLPs, residuals) that merit further theoretical investigation.


References

  1. Xu, Z., Jain, S., & Kankanhalli, M. (2024). Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv:2401.11817.

  2. Banerjee, S., et al. (2024). LLMs Will Always Hallucinate, and We Need to Live With This. arXiv:2409.05746.

  3. Zhai, S., et al. (2023). Stabilizing Transformer Training by Preventing Attention Entropy Collapse. ICML 2023.

  4. Mind the Gap: A Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers. (2025). arXiv:2410.07799.

  5. Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning. (2026). arXiv:2601.00791.

  6. Lei, S., et al. (2025). Revisiting LLM Reasoning via Information Bottleneck. arXiv:2507.18391.

  7. Cui, Y., et al. (2025). The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models. arXiv:2505.22617.

  8. Apple ML Research. (2025). The Illusion of Thinking. Apple ML Technical Report.

  9. Raju, S. & Netrapalli, P. (2025). A Model of Errors in Transformers. arXiv:2601.14175.

  10. Papadimitriou, C., et al. (2024). On Limitations of the Transformer Architecture. NSF PAR.

  11. Mudarisov, T., et al. (2025). Limitations of Normalization in Attention Mechanism. arXiv:2508.17821.

  12. Duman Keles, F., et al. (2023). On The Computational Complexity of Self-Attention. PMLR.

  13. Agarwal, R., et al. (2025). The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning. arXiv:2505.15134.

  14. ACL Findings. (2025). Implicit Reasoning in Transformers is Reasoning through Shortcuts.

  15. EMNLP Main. (2025). Variance Sensitivity Induces Attention Entropy Collapse.

  16. Unpacking Softmax: How Temperature Drives Representation Collapse. (2025). arXiv:2506.01562.