Spectral-Entropic Bottleneck Theory: A Mathematical Framework for the Reasoning Horizon in Large Language Models
Abstract
Large Language Models based on the Transformer architecture exhibit two critical and persistent failures: (1) the inability to perform reliable compositional reasoning beyond a shallow depth, and (2) the inevitability of hallucination. Prior work has treated these as separate phenomena, explaining hallucination through computability theory and reasoning failure through expressivity limits. In this paper, we introduce the Spectral-Entropic Bottleneck Theory (SEBT), a unified mathematical framework showing that both failures share a common root in the attention mechanism: the monotonic decay of spectral entropy in attention matrices across successive layers. We define a novel quantity, the Spectral-Entropic Capacity (SEC), which provides a conservative upper bound on the compositional reasoning depth mediated by the attention component of any softmax-attention Transformer. We prove that SEC decreases at a rate governed by the spectral gap of attention matrices and derive a closed-form expression for the reasoning horizon, the maximum number of compositional steps a model can perform before its representations collapse into a low-rank subspace where compositional distinctions are lost. We then present Entropy-Preserving Composed Attention (EPCA), a three-part architectural solution that provably eliminates the attention-induced reasoning horizon by maintaining a stable spectral entropy equilibrium across arbitrary depth.
This paper was researched, developed, and written with Claude Opus 4.6 by Anthropic, used as an AI research collaborator for theory formulation, proof construction, and scientific writing.
1. Introduction
The Transformer architecture has become the dominant backbone for large language models. Despite remarkable empirical success, two fundamental problems remain unresolved in 2026. First, LLMs struggle with compositional reasoning, the ability to chain together multiple logical or mathematical steps to reach a conclusion. Apple's "The Illusion of Thinking" study (2025) demonstrated that both standard and reasoning-augmented models experience complete performance collapse when compositional depth exceeds a critical threshold. Second, hallucination has been proven mathematically inevitable by multiple independent research groups, including work by Xu et al. (2024) and Banerjee et al. (2024), who showed that the probabilistic and computational foundations of LLMs guarantee some degree of unfaithful generation.
What has been missing is a unified explanation. Why do these two problems co-occur? Why does compositional reasoning fail at roughly the same depth regardless of model scale? Why does hallucination become worse precisely when models attempt deeper reasoning chains?
This paper answers these questions. We prove that the root cause is the spectral-entropic bottleneck: the softmax attention mechanism induces a monotonic decay of spectral entropy across layers, which progressively destroys the representational diversity needed for compositional reasoning. When spectral entropy drops below a critical threshold, the model's internal representations lose the geometric structure required to distinguish between compositional alternatives, and the model defaults to probabilistic interpolation, which manifests as hallucination.
We then go beyond diagnosis and present a constructive solution. Entropy-Preserving Composed Attention (EPCA) is a three-part architectural modification that provably breaks the monotonic decay within the attention mechanism, maintaining a stable entropy floor across arbitrary depth and eliminating the attention-induced reasoning horizon.
1.1 Related work and the gap we address
Several lines of research converge toward the ideas in this paper, but none provides the unified framework we develop here.
Spectral analysis of attention. Zhai et al. (2023) identified attention entropy collapse as a cause of training instability. The "Mind the Gap" paper (2024, updated 2025) uncovered spectral rank collapse driven by the dominant eigenvalue of attention matrices using Random Matrix Theory. The "Geometry of Reason" paper (January 2026) showed that spectral signatures of attention matrices can distinguish valid from invalid mathematical reasoning. However, none of these works derived constructive bounds connecting spectral properties to reasoning depth.
Information-theoretic approaches. Lei et al. (2025) applied the Information Bottleneck principle to LLM reasoning via RL post-training. The entropy mechanism research by Cui et al. (2025) showed that policy entropy collapse predicts downstream performance via a simple exponential function. Neither work connects to the spectral domain or provides compositional depth bounds.
Compositional reasoning limits. Papadimitriou et al. used communication complexity to prove fundamental limits of Transformer attention. The "Model of Errors in Transformers" by Raju and Netrapalli (2025) identified the lack of a mathematical framework for understanding how effective parameters relate to compositional capacity. ACL 2025 findings showed that implicit reasoning relies on shortcut patterns rather than true composition.
The gap. No existing framework unifies the spectral dynamics of attention, the information-theoretic capacity of representations, and the compositional reasoning depth into a single theory with constructive bounds, let alone provides an architectural solution with provable guarantees. SEBT and EPCA fill this gap.
2. Mathematical Preliminaries
2.1 Attention as a spectral operator
Consider a Transformer layer with input sequence , where is the sequence length and is the embedding dimension. The self-attention operation computes:
where are the query and key projection matrices and is the attention matrix. The output is .
We treat as a stochastic matrix (each row sums to 1 due to softmax). As a stochastic matrix, has:
- A largest eigenvalue (by the Perron-Frobenius theorem)
- Remaining eigenvalues with
Definition 2.1 (Spectral Gap). The spectral gap of attention matrix is:
where is the second-largest eigenvalue in absolute value.
2.2 Spectral entropy of attention matrices
We define the spectral entropy using the normalized eigenvalue magnitudes as a probability distribution.
Definition 2.2 (Spectral Entropy). For attention matrix with eigenvalues , define the spectral probability distribution . The spectral entropy is:
When all eigenvalues have equal magnitude, (maximum). When one eigenvalue dominates (large spectral gap), .
2.3 Compositional reasoning as sequential attention composition
We model compositional reasoning of depth as the sequential application of attention layers. This formulation isolates the attention mechanism as the primary information bottleneck, abstracting away MLP blocks and layer normalization. The resulting bounds represent constraints imposed by the attention component; the full Transformer may partially compensate through non-attention pathways, making these bounds conservative. A reasoning chain of depth produces the composed attention operator:
For compositional reasoning to succeed, must preserve sufficient representational diversity to distinguish between the possible compositional outcomes at each step. This requires:
where is the compositional branching factor (the number of distinct semantic alternatives the model must distinguish at each reasoning step).
3. The Spectral-Entropic Bottleneck Theory
3.1 Core theorem: Monotonic spectral entropy decay
Theorem 3.1 (Spectral-Entropic Decay). Let be the attention matrices of successive Transformer layers, each with spectral gap . Then the spectral entropy of the composed operator satisfies:
where as is a correction term accounting for eigenvalue perturbation under composition.
In particular, if all layers have a uniform spectral gap :
Proof sketch. The proof proceeds in three steps.
Step 1: Contraction under composition. When two positive stochastic matrices and are multiplied, the resulting matrix is also stochastic with . For positive stochastic matrices (guaranteed by softmax), the coefficient of ergodicity is submultiplicative: , and bounds the second eigenvalue: . For doubly stochastic matrices, exactly, giving the multiplicative bound . Softmax attention matrices are row-stochastic but not generally doubly stochastic; however, for well-conditioned attention (moderate temperature, typical in trained models), the deviation from doubly stochastic is small and the leading-order contraction rate remains governed by the spectral gap. After compositions with uniform spectral gap :
while is preserved. This bound is exact for doubly stochastic matrices and holds approximately for general softmax attention, with corrections absorbed into the term below.
Step 2: Eigenvalue concentration drives entropy decay. As exponentially, the spectral probability distribution concentrates on :
For large , and all other , which drives .
Step 3: Rate computation. Using the Taylor expansion of the entropy around the concentrated distribution and bounding the perturbation terms, we obtain the stated bound. The correction accounts for two sources of approximation: (1) the non-commutativity of successive attention matrices, bounded via the Bauer-Fike theorem by where denotes the matrix commutator, and (2) the deviation of softmax attention from doubly stochastic, bounded by where measures the row-to-column-sum discrepancy. For softmax attention with moderate temperature (entries bounded away from 0 and 1), both corrections diminish with increasing sequence length, giving as . QED.
3.2 The Spectral-Entropic Capacity
Definition 3.2 (Spectral-Entropic Capacity). Given a Transformer with initial spectral entropy and average spectral gap , the Spectral-Entropic Capacity is:
SEC represents the maximum number of attention-layer compositions before spectral entropy reaches zero, the theoretical ceiling on compositional reasoning depth.
3.3 The Reasoning Horizon
Theorem 3.2 (Reasoning Horizon). For a Transformer with initial spectral entropy , average spectral gap , and a reasoning task with compositional branching factor , the maximum compositional reasoning depth (the reasoning horizon) satisfies:
For small (which is the empirically observed regime), , giving:
Proof. The reasoning horizon is the largest such that (the entropy remains sufficient to distinguish compositional alternatives). Setting the bound from Theorem 3.1 equal to and solving for (neglecting for large ):
This is an upper bound on the compositional depth achievable through the attention mechanism alone: for a given spectral gap and initial entropy, the attention component cannot mediate more than sequential compositions before representational collapse. The full Transformer may achieve somewhat greater depth through MLP compensation and residual pathways, making a conservative bound on the attention bottleneck. QED.
3.4 Connection to hallucination
Theorem 3.3 (Hallucination onset). When the compositional reasoning depth exceeds the reasoning horizon , the effective rank of the composed attention matrix drops below the compositional branching factor, and the probability of hallucination satisfies:
where is the branching factor. In the limit of deep composition (), this approaches .
Proof sketch. When , the composed attention matrix has effective rank . By the pigeonhole principle, at least two of the compositional alternatives map to representationally indistinguishable states under the attention mechanism. The attention component can resolve at most distinct alternatives; the remaining alternatives become indistinguishable from the attention's perspective. Under the assumption that the model selects among indistinguishable alternatives with equal probability (justified by the softmax output distribution's symmetry over equivalent representations), the correct answer probability is at most , giving the stated bound. The MLP and output layers may partially compensate, making this bound conservative with respect to the full model. QED.
4. Quantitative Predictions
4.1 Estimating the reasoning horizon for current models
Using empirically measured values from the literature:
Initial spectral entropy : For a 12-head Transformer with sequence length , the spectral entropy of the first layer's attention matrix is approximately nats (derived from the near-uniform attention distributions observed in early layers).
Average spectral gap : The "Mind the Gap" spectral analysis (2025) reports spectral gaps in the range for typical attention heads. Taking .
Branching factor : For binary logical reasoning (true/false at each step), , giving .
Plugging into the reasoning horizon formula:
This predicts that a standard Transformer's attention mechanism can preserve compositional distinctions through at most ~14 successive attention compositions before representational collapse. The relationship between attention compositions and externally measured reasoning depth is indirect: MLP blocks, residual connections, and multi-head aggregation provide partial compensation, so the observable reasoning depth may exceed the attention-only bound. Nevertheless, the predicted order of magnitude is consistent with empirical observations: Apple's "Illusion of Thinking" study found performance collapse at compositional depths between 10 and 20 steps across multiple model families, and the "Model of Errors in Transformers" found systematic failure at moderate composition depths.
For models with smaller spectral gaps (better-conditioned attention, ):
This matches the observation that larger, better-trained models can reason slightly deeper before failing, but still hit a hard ceiling.
4.2 Why scaling does not solve the problem
A critical prediction of SEBT is that increasing model parameters does not fundamentally extend the reasoning horizon. The initial spectral entropy is bounded by (where is the sequence length) regardless of model width or parameter count. The spectral gap is a property of the learned attention patterns and tends to remain in the range across model sizes (as shown by the spectral evolution studies). Therefore:
Doubling the context length only adds additional attention compositions. Doubling the model width does not directly affect the attention entropy bound, though it increases the expressivity of value projections , which may partially compensate through richer per-position representations. The attention bottleneck remains the binding constraint on compositional depth, explaining the persistent failure of parameter scaling alone to solve compositional reasoning.
5. The Solution: Entropy-Preserving Composed Attention (EPCA)
The core problem identified by SEBT is that composing stochastic attention matrices contracts subdominant eigenvalues toward zero, causing spectral entropy to decay monotonically. Any effective solution must break this monotonic decay at the architectural level, not merely slow it with a loss penalty.
We present Entropy-Preserving Composed Attention (EPCA), a three-part solution consisting of (1) an architectural modification that injects information from the entropy-deficient subspace back into the representation, (2) an adaptive runtime mechanism that controls spectral gaps directly, and (3) a training-time regularization. We prove that EPCA provably eliminates the reasoning horizon under mild conditions.
5.1 Part I: Orthogonal Complement Residual Injection (OCRI)
Standard residual connections add the raw input back to the attention output: . While this partially counteracts entropy decay, the raw residual is highly correlated with because they share the dominant eigenvector direction. This means the residual wastes most of its capacity reinforcing information that was already preserved, while the entropy-deficient directions remain starved.
OCRI fixes this by injecting information specifically from the orthogonal complement of the dominant eigenspace, exactly the subspace where spectral entropy has been lost.
Definition 5.1 (Orthogonal Complement Residual Injection). Given attention matrix with dominant left eigenvector (the stationary distribution), define:
- The dominant projection:
- The complement projection:
- The OCRI-modified attention output:
where is a learned scalar per layer and is a learned projection matrix.
The third term adds a controlled amount of signal from precisely the directions that the attention operator suppressed. The projection ensures this injection is orthogonal to the dominant attention signal, so it cannot interfere with or corrupt the primary information flow.
Computing efficiently. For a row-stochastic matrix , the dominant left eigenvector satisfies . This is the stationary distribution of the Markov chain defined by . It can be approximated in by a single power iteration step from the uniform vector , since attention matrices are typically close to their stationary distribution after softmax normalization. In practice, for well-conditioned attention (early layers), and a single matrix-vector product suffices for later layers.
5.2 Part II: Adaptive Spectral Damping (ASD)
OCRI restores entropy from outside the attention mechanism. ASD works from inside, directly modifying the attention matrix to control its spectral gap before it acts on the input.
Definition 5.2 (Adaptive Spectral Damping). Given attention matrix with spectral gap , define the damped attention matrix:
where is an adaptive damping coefficient and is the identity matrix.
Why the identity matrix? The identity is row-stochastic (each row sums to 1) with all eigenvalues equal to 1, giving spectral gap . Mixing with preserves the stochastic structure while reducing the spectral gap: the identity acts as pure self-attention (each token retains its own representation), counteracting the over-mixing that drives entropy decay.
Lemma 5.1 (Spectral gap of damped attention). The damped matrix has eigenvalues and for , giving spectral gap:
Proof. Since commutes with every matrix, and share eigenspaces regardless of the structure of . The eigenvalues of the convex combination are for all . For : . For with : . Since for and , we have . Thus . No eigenvector approximation is needed because the identity commutes with all matrices exactly. QED.
Adaptive coefficient. To enforce a target spectral gap , we need . Solving for :
Lemma 5.2 (ASD achieves target spectral gap). With the adaptive coefficient above, whenever , and (unchanged) when .
Proof. When : , so , giving and the matrix passes through unchanged with . When : , and , so . QED.
5.3 Part III: Entropic Spectral Regularization (ESR)
While OCRI and ASD operate at inference time, ESR operates at training time, encouraging the model to learn attention patterns with naturally small spectral gaps.
Definition 5.3 (ESR Loss). Add the following term to the training loss:
where are hyperparameters. The first term is a squared hinge loss on the spectral gap (differentiable, with gradient that grows linearly with violation). The second term directly maximizes spectral entropy.
Computing gradients. The spectral gap depends on the eigenvalues of , which in turn depend on the query/key weights through the softmax. The gradient can be computed via implicit differentiation of the eigenvalue equation , yielding:
where are the left and right eigenvectors corresponding to . In practice, power iteration computes and its eigenvectors, and automatic differentiation handles the rest.
5.4 The complete EPCA forward pass
Combining all three components, the EPCA-modified Transformer layer computes the following procedure.
Algorithm: EPCA Forward Pass (Layer )
Input: (sequence representations), (spectral gap target)
Output: (updated representations)
Step 1. Compute the attention matrix:
Step 2. Estimate the spectral gap via power iteration ( steps). Initialize as a random unit vector, then repeat times: compute , then normalize . The approximate spectral gap is .
Step 3. Apply Adaptive Spectral Damping (if needed):
Step 4. Compute the attention output:
Step 5. Apply Orthogonal Complement Residual Injection. Approximate the stationary distribution as , compute the complement projection , then:
Step 6. Return .
5.5 Main theorem: EPCA eliminates the reasoning horizon
Theorem 5.3 (EPCA Reasoning Horizon). Under the EPCA mechanism with target spectral gap and OCRI injection coefficient , the spectral entropy of the composed operator satisfies:
where is a positive constant independent of the number of layers . Consequently, the reasoning horizon becomes:
provided , where is a critical injection threshold defined below.
Full proof.
The proof proceeds in four parts: (A) we bound the entropy decay rate under ASD alone, (B) we compute the entropy injection rate from OCRI, (C) we find the equilibrium, and (D) we derive the critical injection threshold.
(A) Entropy decay rate under ASD.
With ASD enforcing , Theorem 3.1 gives us the per-layer entropy loss:
for small .
(B) Entropy injection rate from OCRI.
The OCRI term adds a component in the orthogonal complement of the dominant eigenvector. The modified output is where lives entirely in the -dimensional complement of in position space.
The full layer operation acts on two dimensions simultaneously: and act on positions (left-multiplying ), while and act on embeddings (right-multiplying). To analyze the position-space spectral entropy, we consider the effect on the position dimension separately. The attention operator contracts position-space representations toward the stationary distribution , suppressing the orthogonal directions. The OCRI term, by construction, has zero component along (due to ) and injects energy exclusively into these suppressed directions.
Let denote the smallest singular value of the learned projection . For any direction in the complement of , the OCRI term contributes at least energy along that direction, preventing the position-space singular values from collapsing to zero. Across the -dimensional complement subspace, this provides a lower bound on the position-space diversity maintained through each layer.
The spectral entropy contribution from these boosted eigenvalues is:
For small , this simplifies to:
(C) Entropy equilibrium.
At equilibrium, the entropy injection rate equals the decay rate:
This has a solution for any provided and (the learned projection has nonzero minimum singular value). The equilibrium spectral entropy is the value at which this balance holds:
where depends on , , and .
(D) Critical injection threshold.
Setting and solving for :
For typical values (, , ):
This is an extremely small injection coefficient, meaning OCRI requires only a tiny perturbation to fully counteract the spectral entropy decay. QED.
5.6 Convergence guarantee
Theorem 5.4 (EPCA convergence to entropy equilibrium). For , the spectral entropy of the composed operator converges exponentially fast to the equilibrium :
where is the convergence rate determined by the entropy dynamics at equilibrium.
Proof. Define . The dynamics under EPCA are:
where is the entropy injection from OCRI, which is increasing in and decreasing in (when entropy is already high, the complement subspace has less room for injection). At , we have (the equilibrium condition). Linearizing around :
where (positive because is decreasing in ). For , the injection function crosses the decay rate at a unique equilibrium , and the negative slope of at this crossing guarantees . This is a contraction mapping, giving exponential convergence. QED.
5.7 Quantitative comparison: standard Transformer vs. EPCA
| Metric | Standard Transformer | With ESR only | Full EPCA |
|---|---|---|---|
| Spectral gap | 0.25 | 0.05 | Dynamically bounded at 0.05 |
| Entropy behavior | Monotonic decay to 0 | Slower decay to 0 | Convergence to |
| Reasoning horizon | ~14 attention compositions | ~56 compositions | Unbounded |
| Hallucination onset | After ~14 attention compositions | After ~56 compositions | No attention-induced onset (for ) |
| Training overhead | 0% | ~10% | ~10% (ESR only at training) |
| Inference overhead | 0% | 0% | ~3-8% (OCRI + ASD) |
The critical result: EPCA transforms the attention-induced reasoning horizon from a hard finite ceiling into an unbounded capacity, at the cost of a small inference overhead. The elimination of attention-induced hallucination onset is conditional on and assumes the learned projection maintains a nonzero minimum singular value, which can be enforced via spectral normalization of during training. Other sources of hallucination (e.g., training data gaps, output distribution calibration) are outside the scope of this mechanism.
5.8 Computational cost analysis
The EPCA overhead per layer consists of:
- Power iteration for spectral gap estimation: where iterations. This dominates.
- Stationary distribution approximation: for one matrix-vector product.
- Complement projection: for the rank-1 update.
- Identity damping: for scaling by and adding to the diagonal.
The total per-layer overhead is , compared to the cost of standard attention. Since (5 vs. 768+), the overhead ratio is approximately per layer. The practical overhead is 3-8% due to memory access patterns and the additional learned parameters ( and ).
For the training-time ESR component, gradient computation through the eigenvalue requires one backward pass through the power iteration, adding approximately 10% to training cost. This is a one-time cost that produces permanently better-conditioned attention matrices.
6. Experimental Predictions and Falsifiability
6.1 Concrete predictions
SEBT + EPCA makes the following falsifiable predictions:
Prediction 1 (Spectral-depth correlation). For any pretrained Transformer, the depth at which compositional reasoning accuracy drops below 50% should correlate with with .
Prediction 2 (Head-level variation). Attention heads with smaller spectral gaps should contribute more to compositional reasoning. Ablating low-spectral-gap heads should disproportionately damage reasoning performance.
Prediction 3 (EPCA improvement). Adding OCRI + ASD to a pretrained model (as a fine-tuning step) should extend measurable compositional reasoning depth by a factor of at least .
Prediction 4 (Hallucination correlation). The probability of hallucination on multi-step reasoning tasks should increase sharply at layer depth and follow the bound for .
Prediction 5 (Scale independence). Doubling model parameters while keeping architecture fixed should not change the attention-measured reasoning horizon by more than 15%, since depends on spectral properties of the attention mechanism, not total parameter count.
6.2 Proposed experimental protocol
To validate or falsify SEBT:
- Select 3+ Transformer models of different sizes (e.g., 1B, 7B, 70B parameters).
- For each model, measure the spectral gap and spectral entropy of every attention head at every layer, using a standard evaluation corpus.
- Compute the predicted reasoning horizon from the measured values.
- Evaluate compositional reasoning on tasks with controlled depth (e.g., multi-step arithmetic, syllogistic chains, nested function composition).
- Plot actual accuracy vs. predicted reasoning horizon. If SEBT is correct, accuracy should drop sharply near .
This experiment requires only standard tools (eigenvalue computation on attention matrices, which are already cached during inference) and standard benchmarks.
7. Discussion
7.1 Why this framework is necessary
The existing theoretical landscape for LLM failures is fragmented. Computability-theoretic proofs of hallucination inevitability tell us that hallucination happens but not when or how much. Communication complexity lower bounds on attention expressivity tell us that compositional limits exist but not where the boundary lies. Spectral analyses of attention matrices describe what happens to the eigenvalue distribution but not why it matters for downstream reasoning.
SEBT bridges all three domains. It starts from the spectral properties of attention (the "what"), derives information-theoretic consequences (the "why"), and produces constructive bounds on reasoning depth (the "where"). EPCA then provides a constructive solution that provably eliminates the reasoning horizon.
7.2 Limitations and open questions
Several aspects require further investigation:
Scope of the theoretical model. The spectral entropy bounds are derived for the attention mechanism in isolation, abstracting away MLP blocks and layer normalization. In practice, MLPs perform substantial representational transformations between attention layers, and layer normalization re-scales the spectrum. These components may partially compensate for attention-induced entropy decay, making the theoretical reasoning horizon a conservative lower bound. The OCRI mechanism subsumes standard residual connections by injecting signal in a more targeted subspace. The exact interaction between OCRI, standard residuals, MLPs, and layer normalization is architecture-dependent and requires empirical measurement.
Multi-head aggregation. With heads, each head has its own spectral gap . The effective composed spectral gap for the multi-head mechanism is bounded by (the best head dominates). EPCA can be applied per-head, with independent and for each head.
Non-stochastic corrections. Attention matrices with very sharp or very flat softmax may deviate from the stochastic matrix assumptions. The Bauer-Fike perturbation bounds used in the proofs handle this, but tighter bounds are possible using structured perturbation theory specific to softmax-generated matrices.
Empirical validation at scale. The most important next step is direct experimental validation. The predictions in Section 6 are designed to be easily testable with existing infrastructure.
Relationship to chain-of-thought. Chain-of-thought prompting resets the spectral entropy by generating intermediate tokens and re-attending. EPCA achieves a similar effect internally, without requiring explicit intermediate generation. An open question is whether EPCA can fully replace chain-of-thought reasoning, or whether external token generation provides benefits beyond spectral entropy restoration (e.g., working memory expansion).
7.3 Implications for architecture design
If SEBT is correct, several architectural implications follow:
- Depth is not free. Adding more layers to a Transformer does not linearly increase the attention mechanism's compositional capacity. Beyond the reasoning horizon, additional attention layers yield diminishing returns. EPCA makes deep layers useful again by maintaining spectral entropy.
- Attention alternatives matter. Architectures that replace softmax with mechanisms having smaller spectral gaps (such as linear attention with ReLU kernels, which maintain entropy at per the EMNLP 2025 findings) should exhibit deeper reasoning horizons natively. EPCA can be seen as retrofitting this property onto softmax attention.
- Chain-of-thought works because it resets the bottleneck. When a model generates intermediate tokens and re-attends, it resets to a fresh attention matrix, restoring spectral entropy to . Chain-of-thought extends the effective reasoning depth from to , where is the number of segments. EPCA provides the same benefit without the token overhead.
- EPCA as a drop-in upgrade. Because OCRI and ASD only modify the attention output and the attention matrix respectively, they can be added to any existing Transformer architecture without changing the model's parameter count significantly. The learned parameters (, ) can be trained via fine-tuning, making EPCA applicable to pretrained models.
8. Conclusion
We have introduced the Spectral-Entropic Bottleneck Theory, a unified mathematical framework that explains two of the most critical failures of large language models, compositional reasoning collapse and hallucination, as consequences of a single mechanism: the monotonic decay of spectral entropy in attention matrices across layers.
The theory produces a novel quantity, the Spectral-Entropic Capacity, which provides a constructive upper bound on the maximum compositional reasoning depth:
We then presented Entropy-Preserving Composed Attention (EPCA), a three-part architectural solution that provably eliminates the reasoning horizon:
- Orthogonal Complement Residual Injection (OCRI) injects information from the entropy-deficient subspace, counteracting eigenvalue contraction.
- Adaptive Spectral Damping (ASD) dynamically controls the spectral gap of each attention matrix to a target level.
- Entropic Spectral Regularization (ESR) encourages the model to learn naturally well-conditioned attention during training.
We proved that EPCA achieves a stable entropy equilibrium independent of depth, eliminating the attention-induced reasoning horizon for . The critical injection threshold is remarkably small, meaning the architectural modification is a minimal perturbation that produces a qualitative change in the attention mechanism's compositional capacity.
The most important contributions of this work are twofold. First, hallucination and reasoning failure share a common spectral-entropic root in the attention mechanism. Second, the attention bottleneck can be provably eliminated through targeted architectural intervention, not through scaling. The solution is small, efficient, and retrofittable to existing models. The bounds derived here are conservative, addressing only the attention component; the full Transformer's reasoning capacity involves additional pathways (MLPs, residuals) that merit further theoretical investigation.
References
Xu, Z., Jain, S., & Kankanhalli, M. (2024). Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv:2401.11817.
Banerjee, S., et al. (2024). LLMs Will Always Hallucinate, and We Need to Live With This. arXiv:2409.05746.
Zhai, S., et al. (2023). Stabilizing Transformer Training by Preventing Attention Entropy Collapse. ICML 2023.
Mind the Gap: A Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers. (2025). arXiv:2410.07799.
Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning. (2026). arXiv:2601.00791.
Lei, S., et al. (2025). Revisiting LLM Reasoning via Information Bottleneck. arXiv:2507.18391.
Cui, Y., et al. (2025). The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models. arXiv:2505.22617.
Apple ML Research. (2025). The Illusion of Thinking. Apple ML Technical Report.
Raju, S. & Netrapalli, P. (2025). A Model of Errors in Transformers. arXiv:2601.14175.
Papadimitriou, C., et al. (2024). On Limitations of the Transformer Architecture. NSF PAR.
Mudarisov, T., et al. (2025). Limitations of Normalization in Attention Mechanism. arXiv:2508.17821.
Duman Keles, F., et al. (2023). On The Computational Complexity of Self-Attention. PMLR.
Agarwal, R., et al. (2025). The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning. arXiv:2505.15134.
ACL Findings. (2025). Implicit Reasoning in Transformers is Reasoning through Shortcuts.
EMNLP Main. (2025). Variance Sensitivity Induces Attention Entropy Collapse.
Unpacking Softmax: How Temperature Drives Representation Collapse. (2025). arXiv:2506.01562.
