By proposing a new metric called distribution locality, which measures the number of tokens beyond local statistics required to correlate with the target distribution, this section formalizes the local reasoning barrier in Transformers. The authors demonstrate that tasks with high locality are intrinsically challenging for Transformers trained using stochastic gradient descent using the cycle task.By proposing a new metric called distribution locality, which measures the number of tokens beyond local statistics required to correlate with the target distribution, this section formalizes the local reasoning barrier in Transformers. The authors demonstrate that tasks with high locality are intrinsically challenging for Transformers trained using stochastic gradient descent using the cycle task.

Understanding the Local Reasoning Barrier in Transformer Models

2025/11/03 19:10
4 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

\

Abstract and 1. Introduction

1.1 Syllogisms composition

1.2 Hardness of long compositions

1.3 Hardness of global reasoning

1.4 Our contributions

  1. Results on the local reasoning barrier

    2.1 Defining locality and auto-regressive locality

    2.2 Transformers require low locality: formal results

    2.3 Agnostic scratchpads cannot break the locality

  2. Scratchpads to break the locality

    3.1 Educated scratchpad

    3.2 Inductive Scratchpads

  3. Conclusion, Acknowledgments, and References

A. Further related literature

B. Additional experiments

C. Experiment and implementation details

D. Proof of Theorem 1

E. Comment on Lemma 1

F. Discussion on circuit complexity connections

G. More experiments with ChatGPT

2 Results on the local reasoning barrier

Prior literature. Much work in the learning literature has been devoted to obtaining complexity measures for the sample/time complexity of learning. The largest portion is devoted to target classes in PAC settings, e.g., with the VC dimension measures [18], and some to statistical query (SQ) settings with the statistical dimension measures [14, 19]. Here, we are however interested in measures that are relevant to (1) regular Transformers (or related models) trained by (S)GD, and (2) data distribution fixed by a task. Some recent literature has studied complexity measures for (S)GD-trained neural networks. Various settings and measures have been used, such as the noise sensitivity [20, 6, 21], the cross-predictability [12, 15], the NTK alignment [22, 23], the INAL [24], the G-alignment [13], the information and generative exponents [25, 26, 27] and the leap [28]; we refer to Appendix A.2 for discussions on these.

\ However, despite this significant body of work, finding a simple measure giving a tight proxy for Transformer weak learning (i.e., the first non-trivial learning requirement) on a given data distribution, remains unsettled. We next propose such a measure.

2.1 Defining locality and auto-regressive locality

We define now the notion of distribution locality, which in turn will quantify the notion of locality (or local reasoning) barrier.

\

We now define the globality in the autoregressive setting.

\

\ In the auto-regressive setting, the locality is mostly relevant when weak learning gives strong learning, in order to let the scratchpad learn each step.

As we will see in the next section, the locality is put forward as a tight proxy to understand efficient weak learning of regular Transformers for arbitrary data distributions. We first present the operational advantages of the definition, going back to the running example of the cycle task.

Attributes of glob and some examples. The locality has the attributes that of being (i) a fairly explicit measure, (ii) applicable to any data distribution on tokens without having to infer a distribution class from the model invariances to estimate the distribution complexity, (iii) not limited to i.i.d. inputs but any input distribution, (iv) relevant to current models of interest such as Transformers.

\

\ As discussed in the next section, this explains why the cycle task is hard to learn. In contrast, the example at the beginning of Section 1.2 has a much lower locality, as being connected correlates to query nodes having large enough degrees, and thus it can be expected for the model to learn with non-trivial accuracy (e.g., using degree shortcuts).

\

2.2 Transformers require low locality: formal results

\

We now state the general conjecture putting forward the globality barrier for learning.

\

\ Remark 3. (1) An important property of the learning model for the above conjecture is that the probability distribution of the function computed by the model is invariant under permutations of the inputs, and if it is trained in a reasonable way on samples drawn from a probability distribution drawn from a class that is symmetric under permutations of the inputs, its probability distribution will retain its symmetry under permutations of the inputs. For MLPs, we expect most of the results in this paper to apply, with the modification of

\

\

We prove the negative side of Conjecture 1 for a variant of the cycle task

\

The proof of Theorem 1 is presented in Appendix F.

2.3 Agnostic scratchpads cannot break the globality

Next, we put forward a conjecture that agnostic scratchpads (scratchpads without direct supervision on the scratchpad tokens) cannot break the globality barrier.

\

\ A natural counterpart of Theorem 1 holds for the previous conjecture (see Theorem 2). In order to define the Transformer’s loss on any given input it takes the expectation over every possible value of the scratchpad it might generate, and its proof is essentially identical to that of Theorem 1.

\

\

\ \

:::info Authors:

(1) Emmanuel Abbe, Apple and EPFL;

(2) Samy Bengio, Apple;

(3) Aryo Lotf, EPFL;

(4) Colin Sandon, EPFL;

(5) Omid Saremi, Apple.

:::


:::info This paper is available on arxiv under CC BY 4.0 license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Xenea Wallet Daily Quiz March 16, 2026: Claim Your Free Crypto Coins Now

Xenea Wallet Daily Quiz March 16, 2026: Claim Your Free Crypto Coins Now

Xenea Wallet Daily Quiz Encourages Learning Through Rewards Educational features within blockchain platforms are becoming increasingly common as developers att
Share
Hokanews2026/03/16 04:33
Polygon Tops RWA Rankings With $1.1B in Tokenized Assets

Polygon Tops RWA Rankings With $1.1B in Tokenized Assets

The post Polygon Tops RWA Rankings With $1.1B in Tokenized Assets appeared on BitcoinEthereumNews.com. Key Notes A new report from Dune and RWA.xyz highlights Polygon’s role in the growing RWA sector. Polygon PoS currently holds $1.13 billion in RWA Total Value Locked (TVL) across 269 assets. The network holds a 62% market share of tokenized global bonds, driven by European money market funds. The Polygon POL $0.25 24h volatility: 1.4% Market cap: $2.64 B Vol. 24h: $106.17 M network is securing a significant position in the rapidly growing tokenization space, now holding over $1.13 billion in total value locked (TVL) from Real World Assets (RWAs). This development comes as the network continues to evolve, recently deploying its major “Rio” upgrade on the Amoy testnet to enhance future scaling capabilities. This information comes from a new joint report on the state of the RWA market published on Sept. 17 by blockchain analytics firm Dune and data platform RWA.xyz. The focus on RWAs is intensifying across the industry, coinciding with events like the ongoing Real-World Asset Summit in New York. Sandeep Nailwal, CEO of the Polygon Foundation, highlighted the findings via a post on X, noting that the TVL is spread across 269 assets and 2,900 holders on the Polygon PoS chain. The Dune and https://t.co/W6WSFlHoQF report on RWA is out and it shows that RWA is happening on Polygon. Here are a few highlights: – Leading in Global Bonds: Polygon holds 62% share of tokenized global bonds (driven by Spiko’s euro MMF and Cashlink euro issues) – Spiko U.S.… — Sandeep | CEO, Polygon Foundation (※,※) (@sandeepnailwal) September 17, 2025 Key Trends From the 2025 RWA Report The joint publication, titled “RWA REPORT 2025,” offers a comprehensive look into the tokenized asset landscape, which it states has grown 224% since the start of 2024. The report identifies several key trends driving this expansion. According to…
Share
BitcoinEthereumNews2025/09/18 00:40
Watch Out: Entering a Very Critical Week – Numerous Economic Developments and Altcoin Events Ahead This Week – Here’s the Day-by-Day, Hour-by-Hour Schedule

Watch Out: Entering a Very Critical Week – Numerous Economic Developments and Altcoin Events Ahead This Week – Here’s the Day-by-Day, Hour-by-Hour Schedule

The cryptocurrency market will witness numerous significant economic developments and altcoin events in the coming week. Here's the list. Continue Reading: Watch
Share
Bitcoinsistemi2026/03/16 04:21