AI generated Image
When the Transformer architecture was introduced in the landmark 2017 paper “Attention Is All You Need,” it revolutionized natural language processing. But buried within its elegant self-attention mechanism was a critical detail that made everything work: Positional encoding. And not just any positional encoding, but a carefully designed sinusoidal one. Let’s understand why this matters and what makes sinusoidal encoding special.(A very important Interview question !!)
Unlike recurrent neural networks (RNNs) that process sequences one token at a time, Transformers process entire sequences in parallel. This parallelization is their superpower, making them incredibly fast to train. But it comes with a catch: the model has no inherent way to understand the order of tokens.
Consider these two sentences:
Without positional information, a Transformer would treat these identically because it just sees the same bag of words. The meaning is completely different, but the model wouldn’t know which word came first. This is catastrophic for language understanding.
We need to inject positional information into the model somehow. But how?
Your first instinct might be to use simple integer positions: assign position 1 to the first word, position 2 to the second, and so on. This seems intuitive, but it creates several problems.
Problem 1: Unbounded Values
With linear encoding, position values grow without limit. The 1000th token gets a value of 1000, which is vastly different in scale from the first few tokens. Neural networks struggle with such varying scales because the model parameters need to handle both tiny and huge numbers simultaneously. This makes training unstable.
Problem 2: No Generalization to Longer Sequences
If your model trains on sequences of maximum length 512, it never sees position 513 or beyond. With linear encoding, these unseen positions are completely out of distribution. The model has no way to extrapolate what position 600 means because it’s never encountered anything like it during training.
Problem 3: No Meaningful Relationships
Linear encoding doesn’t capture any useful relationships between positions. Is position 50 somehow related to position 51? With raw integers, the model must learn these relationships from scratch with no inductive bias to help.
Sinusoidal positional encoding solves all these problems elegantly. The key insight is to use sine and cosine functions with different frequencies to create unique, bounded encodings for each position.
Here’s the mathematical formulation for a position pos and dimension i:
For even dimensions (i = 0, 2, 4, …):
PE(pos, i) = sin(pos / 10000^(i/d_model))
For odd dimensions (i = 1, 3, 5, …):
PE(pos, i+1) = cos(pos / 10000^(i/d_model))
where d_model is the dimension of the embedding space (typically 512 or 768).
Let’s break down why this works so well.
Sine and cosine functions always output values between -1 and 1, regardless of input. This means position 1 and position 1000 both have encodings in the same range. No more scaling issues. The neural network can handle these values comfortably across all positions.(i.e means no vanishing and exploding gradient issue isn’t this concept is fascinating)
The term 10000^(i/d_model) creates different frequencies for different dimensions. Lower dimensions oscillate rapidly (high frequency), while higher dimensions oscillate slowly (low frequency).
Different dimensions oscillate at different frequencies to create unique position fingerprintsThink of this like a binary counter, but with smooth sinusoidal waves instead of discrete bits. Lower dimensions change with every position, while higher dimensions change only gradually. This creates a unique “fingerprint” for each position.
For dimension 0, the wavelength is 2π (changes rapidly). For dimension d_model, the wavelength is approximately 2π × 10000 (changes very slowly).
This multi-scale representation means nearby positions have similar encodings, while far positions are distinguishable.
Here’s the mathematical beauty: sinusoidal functions have a special property that allows the model to learn relative positions easily.
For any fixed offset k, the encoding at position (pos + k) can be represented as a linear transformation of the encoding at position pos:
PE(pos + k) = T × PE(pos)
where T is a transformation matrix that depends only on k, not on pos.
This comes from the angle addition formulas:
sin(α + β) = sin(α)cos(β) + cos(α)sin(β)
cos(α + β) = cos(α)cos(β) - sin(α)sin(β)Visualization of sinusoidal positional encoding.
What this means in practice: if the model learns that “words 3 positions apart tend to be related,” it can apply this learning uniformly across the entire sequence. The relationship between positions 5 and 8 is encoded the same way as between positions 50 and 53.
Because the encoding is a continuous function, the model can theoretically handle any position, even those longer than it saw during training. The sinusoidal function doesn’t suddenly break at position 513 just because training stopped at 512. The pattern continues smoothly.
In practice, there are still challenges with very long sequences, but sinusoidal encoding at least gives the model a fighting chance, whereas linear encoding would fail completely.
Let’s visualize this with a simple example. Suppose we have a 4-dimensional embedding space (in reality, it’s much larger):
For position 0:
For position 1:
Notice how the lower dimensions (0, 1) change significantly between positions, while higher dimensions (2, 3) change only slightly. This multi-resolution encoding captures both fine-grained and coarse positional information.
The choice of 10000 as the base in the formula isn’t arbitrary. It’s chosen to create a geometric progression of wavelengths across dimensions that works well for typical sequence lengths in NLP tasks.
With this base, the wavelengths range from 2π (minimum) to approximately 20,000π (maximum) for a 512-dimensional model. This range is suitable for sequences of a few thousand tokens, which covers most practical use cases.
It takes the simple requirement of “telling the model what order tokens appear in” and solves it with a mathematically principled approach that provides bounded values, smooth interpolation, learnable relative position relationships, and reasonable extrapolation to unseen sequence lengths.
The next time you use ChatGPT or any other Transformer-based model, remember that buried in those billions of parameters is a surprisingly simple sine wave helping the model understand that “cat chased mouse” is very different from “mouse chased cat.”
Thank you for reading!🤗 I hope that you found this article both informative and enjoyable to read.
For more AI content, follow me on LinkedIn and give me a clap 👏.
Sinusoidal Positional Encoding in Transformers: A Deep Dive was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.


