AI generated Image When the Transformer architecture was introduced in the landmark 2017 paper “Attention Is All You Need,” it revolutionized natural language AI generated Image When the Transformer architecture was introduced in the landmark 2017 paper “Attention Is All You Need,” it revolutionized natural language

Sinusoidal Positional Encoding in Transformers: A Deep Dive

2026/01/05 20:51

AI generated Image

When the Transformer architecture was introduced in the landmark 2017 paper “Attention Is All You Need,” it revolutionized natural language processing. But buried within its elegant self-attention mechanism was a critical detail that made everything work: Positional encoding. And not just any positional encoding, but a carefully designed sinusoidal one. Let’s understand why this matters and what makes sinusoidal encoding special.(A very important Interview question !!)

The Problem: Transformers Have No Sense of Order

Unlike recurrent neural networks (RNNs) that process sequences one token at a time, Transformers process entire sequences in parallel. This parallelization is their superpower, making them incredibly fast to train. But it comes with a catch: the model has no inherent way to understand the order of tokens.

Consider these two sentences:

  • “The cat chased the mouse”
  • “The mouse chased the cat”

Without positional information, a Transformer would treat these identically because it just sees the same bag of words. The meaning is completely different, but the model wouldn’t know which word came first. This is catastrophic for language understanding.

We need to inject positional information into the model somehow. But how?

Why Not Simple Linear Encoding?

Your first instinct might be to use simple integer positions: assign position 1 to the first word, position 2 to the second, and so on. This seems intuitive, but it creates several problems.

Problem 1: Unbounded Values

With linear encoding, position values grow without limit. The 1000th token gets a value of 1000, which is vastly different in scale from the first few tokens. Neural networks struggle with such varying scales because the model parameters need to handle both tiny and huge numbers simultaneously. This makes training unstable.

Problem 2: No Generalization to Longer Sequences

If your model trains on sequences of maximum length 512, it never sees position 513 or beyond. With linear encoding, these unseen positions are completely out of distribution. The model has no way to extrapolate what position 600 means because it’s never encountered anything like it during training.

Problem 3: No Meaningful Relationships

Linear encoding doesn’t capture any useful relationships between positions. Is position 50 somehow related to position 51? With raw integers, the model must learn these relationships from scratch with no inductive bias to help.

Why Sinusoidal Encoding?

Sinusoidal positional encoding solves all these problems elegantly. The key insight is to use sine and cosine functions with different frequencies to create unique, bounded encodings for each position.

Here’s the mathematical formulation for a position pos and dimension i:

For even dimensions (i = 0, 2, 4, …):

PE(pos, i) = sin(pos / 10000^(i/d_model))

For odd dimensions (i = 1, 3, 5, …):

PE(pos, i+1) = cos(pos / 10000^(i/d_model))

where d_model is the dimension of the embedding space (typically 512 or 768).

Let’s break down why this works so well.

The Mathematics Behind the Magic

Bounded Values

Sine and cosine functions always output values between -1 and 1, regardless of input. This means position 1 and position 1000 both have encodings in the same range. No more scaling issues. The neural network can handle these values comfortably across all positions.(i.e means no vanishing and exploding gradient issue isn’t this concept is fascinating)

Different Frequencies for Different Dimensions

The term 10000^(i/d_model) creates different frequencies for different dimensions. Lower dimensions oscillate rapidly (high frequency), while higher dimensions oscillate slowly (low frequency).

Different dimensions oscillate at different frequencies to create unique position fingerprints

Think of this like a binary counter, but with smooth sinusoidal waves instead of discrete bits. Lower dimensions change with every position, while higher dimensions change only gradually. This creates a unique “fingerprint” for each position.

For dimension 0, the wavelength is 2π (changes rapidly). For dimension d_model, the wavelength is approximately 2π × 10000 (changes very slowly).

This multi-scale representation means nearby positions have similar encodings, while far positions are distinguishable.

Linear Relationships Through Trigonometry

Here’s the mathematical beauty: sinusoidal functions have a special property that allows the model to learn relative positions easily.

For any fixed offset k, the encoding at position (pos + k) can be represented as a linear transformation of the encoding at position pos:

PE(pos + k) = T × PE(pos)

where T is a transformation matrix that depends only on k, not on pos.

This comes from the angle addition formulas:

sin(α + β) = sin(α)cos(β) + cos(α)sin(β)
cos(α + β) = cos(α)cos(β) - sin(α)sin(β)Visualization of sinusoidal positional encoding.

What this means in practice: if the model learns that “words 3 positions apart tend to be related,” it can apply this learning uniformly across the entire sequence. The relationship between positions 5 and 8 is encoded the same way as between positions 50 and 53.

Extrapolation to Unseen Lengths

Because the encoding is a continuous function, the model can theoretically handle any position, even those longer than it saw during training. The sinusoidal function doesn’t suddenly break at position 513 just because training stopped at 512. The pattern continues smoothly.

In practice, there are still challenges with very long sequences, but sinusoidal encoding at least gives the model a fighting chance, whereas linear encoding would fail completely.

Example

Let’s visualize this with a simple example. Suppose we have a 4-dimensional embedding space (in reality, it’s much larger):

For position 0:

  • Dimension 0: sin(0 / 10000^(0/4)) = sin(0) = 0
  • Dimension 1: cos(0 / 10000^(0/4)) = cos(0) = 1
  • Dimension 2: sin(0 / 10000^(2/4)) = sin(0) = 0
  • Dimension 3: cos(0 / 10000^(2/4)) = cos(0) = 1

For position 1:

  • Dimension 0: sin(1 / 1) ≈ 0.841
  • Dimension 1: cos(1 / 1) ≈ 0.540
  • Dimension 2: sin(1 / 100) ≈ 0.010
  • Dimension 3: cos(1 / 100) ≈ 0.9999

Notice how the lower dimensions (0, 1) change significantly between positions, while higher dimensions (2, 3) change only slightly. This multi-resolution encoding captures both fine-grained and coarse positional information.

Why 10000 as the Base?

The choice of 10000 as the base in the formula isn’t arbitrary. It’s chosen to create a geometric progression of wavelengths across dimensions that works well for typical sequence lengths in NLP tasks.

With this base, the wavelengths range from 2π (minimum) to approximately 20,000π (maximum) for a 512-dimensional model. This range is suitable for sequences of a few thousand tokens, which covers most practical use cases.

Conclusion

It takes the simple requirement of “telling the model what order tokens appear in” and solves it with a mathematically principled approach that provides bounded values, smooth interpolation, learnable relative position relationships, and reasonable extrapolation to unseen sequence lengths.

The next time you use ChatGPT or any other Transformer-based model, remember that buried in those billions of parameters is a surprisingly simple sine wave helping the model understand that “cat chased mouse” is very different from “mouse chased cat.”

Thank you for reading!🤗 I hope that you found this article both informative and enjoyable to read.

For more AI content, follow me on LinkedIn and give me a clap 👏.


Sinusoidal Positional Encoding in Transformers: A Deep Dive was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Piyasa Fırsatı
DeepBook Logosu
DeepBook Fiyatı(DEEP)
$0.045988
$0.045988$0.045988
-1.11%
USD
DeepBook (DEEP) Canlı Fiyat Grafiği
Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen service@support.mexc.com ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.

Ayrıca Şunları da Beğenebilirsiniz

Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment?

Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment?

The post Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment? appeared on BitcoinEthereumNews.com. Crypto News 17 September 2025 | 17:39 Is dogecoin really fading? As traders hunt the best crypto to buy now and weigh 2025 picks, Dogecoin (DOGE) still owns the meme coin spotlight, yet upside looks capped, today’s Dogecoin price prediction says as much. Attention is shifting to projects that blend culture with real on-chain tools. Buyers searching “best crypto to buy now” want shipped products, audits, and transparent tokenomics. That frames the true matchup: dogecoin vs. Pepeto. Enter Pepeto (PEPETO), an Ethereum-based memecoin with working rails: PepetoSwap, a zero-fee DEX, plus Pepeto Bridge for smooth cross-chain moves. By fusing story with tools people can use now, and speaking directly to crypto presale 2025 demand, Pepeto puts utility, clarity, and distribution in front. In a market where legacy meme coin leaders risk drifting on sentiment, Pepeto’s execution gives it a real seat in the “best crypto to buy now” debate. First, a quick look at why dogecoin may be losing altitude. Dogecoin Price Prediction: Is Doge Really Fading? Remember when dogecoin made crypto feel simple? In 2013, DOGE turned a meme into money and a loose forum into a movement. A decade on, the nonstop momentum has cooled; the backdrop is different, and the market is far more selective. With DOGE circling ~$0.268, the tape reads bearish-to-neutral for the next few weeks: hold the $0.26 shelf on daily closes and expect choppy range-trading toward $0.29–$0.30 where rallies keep stalling; lose $0.26 decisively and momentum often bleeds into $0.245 with risk of a deeper probe toward $0.22–$0.21; reclaim $0.30 on a clean daily close and the downside bias is likely neutralized, opening room for a squeeze into the low-$0.30s. Source: CoinMarketcap / TradingView Beyond the dogecoin price prediction, DOGE still centers on payments and lacks native smart contracts; ZK-proof verification is proposed,…
Paylaş
BitcoinEthereumNews2025/09/18 00:14
CEO Sandeep Nailwal Shared Highlights About RWA on Polygon

CEO Sandeep Nailwal Shared Highlights About RWA on Polygon

The post CEO Sandeep Nailwal Shared Highlights About RWA on Polygon appeared on BitcoinEthereumNews.com. Polygon CEO Sandeep Nailwal highlighted Polygon’s lead in global bonds, Spiko US T-Bill, and Spiko Euro T-Bill. Polygon published an X post to share that its roadmap to GigaGas was still scaling. Sentiments around POL price were last seen to be bearish. Polygon CEO Sandeep Nailwal shared key pointers from the Dune and RWA.xyz report. These pertain to highlights about RWA on Polygon. Simultaneously, Polygon underlined its roadmap towards GigaGas. Sentiments around POL price were last seen fumbling under bearish emotions. Polygon CEO Sandeep Nailwal on Polygon RWA CEO Sandeep Nailwal highlighted three key points from the Dune and RWA.xyz report. The Chief Executive of Polygon maintained that Polygon PoS was hosting RWA TVL worth $1.13 billion across 269 assets plus 2,900 holders. Nailwal confirmed from the report that RWA was happening on Polygon. The Dune and https://t.co/W6WSFlHoQF report on RWA is out and it shows that RWA is happening on Polygon. Here are a few highlights: – Leading in Global Bonds: Polygon holds 62% share of tokenized global bonds (driven by Spiko’s euro MMF and Cashlink euro issues) – Spiko U.S.… — Sandeep | CEO, Polygon Foundation (※,※) (@sandeepnailwal) September 17, 2025 The X post published by Polygon CEO Sandeep Nailwal underlined that the ecosystem was leading in global bonds by holding a 62% share of tokenized global bonds. He further highlighted that Polygon was leading with Spiko US T-Bill at approximately 29% share of TVL along with Ethereum, adding that the ecosystem had more than 50% share in the number of holders. Finally, Sandeep highlighted from the report that there was a strong adoption for Spiko Euro T-Bill with 38% share of TVL. He added that 68% of returns were on Polygon across all the chains. Polygon Roadmap to GigaGas In a different update from Polygon, the community…
Paylaş
BitcoinEthereumNews2025/09/18 01:10
Ionis partner GSK announces positive topline results from B-Well 1 and B-Well 2 Phase 3 studies for bepirovirsen, a potential first-in-class medicine for chronic hepatitis B

Ionis partner GSK announces positive topline results from B-Well 1 and B-Well 2 Phase 3 studies for bepirovirsen, a potential first-in-class medicine for chronic hepatitis B

– Primary endpoint met in both trials – – Bepirovirsen demonstrated a statistically significant and clinically meaningful functional cure rate – – Chronic hepatitis
Paylaş
AI Journal2026/01/07 15:16