After spending months studying transformer architectures and building LLM applications, I realized something: most explanations are overwhelming or missing out some details. This article is my attempt to bridge that gap — explaining transformers the way I wish someone had explained them to me.
For an intro into what Large language model (LLM) means, refer this article I published previously.
By the end of this lesson, you will be able to look at any LLM architecture diagram and understand what is happening.
This is not just academic knowledge — understanding the Transformer architecture will help you make better decisions about model selection, optimize your prompts, and debug issues when your LLM applications behave unexpectedly.
Don't worry if some of these terms sound unfamiliar—we'll explain each concept step by step, starting with the basics. By the end of this lesson, these technical terms will make perfect sense, even if you're new to machine learning architecture.
Let's start with a simple analogy. Imagine you're reading a book and trying to understand a sentence:
To understand this, your brain does several things:
A Transformer does something remarkably similar, but using math. Let me give you a simple explanation of how it works:
What goes in: Text broken into pieces (called tokens)
What's a token? Think of tokens as the basic building blocks that language models understand:
What happens inside: The model processes this text through several stages (we'll explore each in detail):
What comes out: Depends on what you need:
Think of a Transformer like an assembly line where each station refines the product. Raw materials (words) enter, each station adds something (position info, relationships, meaning), and the final product emerges more polished at each step.
Here's how text flows through a Transformer:
The diagram shows how a simple sentence like "The cat sat on the mat" gets processed through the transformer architecture - from tokenization to final output. The key steps include embedding the tokens into vectors, adding positional information, applying self-attention to understand relationships between words, and repeating the attention and processing steps multiple times to refine understanding.
Modern LLMs repeat the attention and processing steps many times:
Now let's walk through each step in detail, starting from the very beginning.
Before the model can process text, it needs to solve two problems: breaking text into pieces (tokenization) and converting those pieces into numbers (embeddings).
The Problem: How do you break text into manageable chunks? You might think "just split by spaces into words," but that's too simple.
Why not just use words?
Consider these challenges:
The solution: Subword Tokenization
Modern models break text into subwords - pieces smaller than words but larger than individual characters. Think of it like Lego blocks: instead of needing a unique piece for every possible structure, you reuse common blocks.
Simple example:
Text: "I am playing happily" Split by spaces (naive approach): ["I", "am", "playing", "happily"] Problem: Need separate entries for "play", "playing", "played", "player", "plays"... Subword tokenization (smart approach): ["I", "am", "play", "##ing", "happy", "##ly"] Better: Reuse "play" and "##ing" for "playing", "running", "jumping" Reuse "happy" and "##ly" for "happily", "sadly", "quickly"
Why this matters - concrete examples:
Real example of tokenization impact:
Input: "The animal didn't cross the street because it was tired" Tokens (what the model actually sees): ["The", "animal", "didn", "'", "t", "cross", "the", "street", "because", "it", "was", "tired"] Notice: - "didn't" → ["didn", "'", "t"] (split to handle contractions) - Each token gets converted to numbers (embeddings) next
The Problem: Computers don't understand tokens. They only work with numbers. So how do we convert "cat" into something a computer can process?
Before we dive in, let's understand what "dimensions" mean with a familiar example:
Describing a person in 3 dimensions:
These 3 numbers (dimensions) give us a mathematical way to represent a person. Now, what if we want to represent a word mathematically?
Describing a word needs way more dimensions:
To capture everything about the word "cat", we need hundreds of numbers:
Modern models use 768 to 4096 dimensions because words are complex! But here's the key: you don't need to understand what each dimension represents. The model figures this out during training.
Let's walk through a concrete example:
# This is a simplified embedding table (real ones have thousands of words) # Each word maps to a list of numbers (a "vector") embedding_table = { "cat": [0.2, -0.5, 0.8, ..., 0.1], # 768 numbers total "dog": [0.3, -0.4, 0.7, ..., 0.2], # Notice: similar to "cat"! "bank": [0.9, 0.1, -0.3, ..., 0.5], # Very different from "cat" } # When we input a sentence: sentence = "The cat sat" # Step 1: Break into tokens tokens = ["The", "cat", "sat"] # Step 2: Look up each token's vector embedded = [ embedding_table["The"], # Gets: [0.1, 0.3, ..., 0.2] (768 numbers) embedding_table["cat"], # Gets: [0.2, -0.5, ..., 0.1] (768 numbers) embedding_table["sat"], # Gets: [0.4, 0.2, ..., 0.3] (768 numbers) ] # Result: We now have 3 vectors, each with 768 dimensions # The model can now do math with these!
Great question! The embedding table isn't written by hand. Here's how it's created:
These embeddings capture word relationships mathematically:
When we say GPT-3 has 175 billion parameters, where are they? A significant chunk lives in the embedding table.
What happens in the embedding layer:
Example: If "cat" = token #847, the model looks up row #847 in its embedding table and retrieves a vector like [0.2, -0.5, 0.7, …] with hundreds or thousands of numbers. Each of these numbers is a parameter that was optimized during training.
This is why embeddings contain so much "knowledge" - they encode the meaning and relationships between words that the model learned from massive amounts of text.
The Problem: After converting words to numbers, we have another issue. Look at these two sentences:
They have the same words, just in different order. But right now, the model sees them as identical because it just has three vectors with no order information!
Real-world example:
Transformers process all words at the same time (unlike reading left-to-right), so we need to explicitly tell the model: "This is word #1, this is word #2, this is word #3."
Think of it like adding page numbers to a book. Each word gets a "position tag" added to its embedding.
Simple Example:
# We have our word embeddings from Step 1: word_embeddings = [ [0.1, 0.3, 0.2, ...], # "The" (768 numbers) [0.2, -0.5, 0.1, ...], # "cat" (768 numbers) [0.4, 0.2, 0.3, ...], # "sat" (768 numbers) ] # Now add position information: position_tags = [ [0.0, 0.5, 0.8, ...], # Position 1 tag (768 numbers) [0.2, 0.7, 0.4, ...], # Position 2 tag (768 numbers) [0.4, 0.9, 0.1, ...], # Position 3 tag (768 numbers) ] # Combine them (add the numbers together): final_embeddings = [ [0.1+0.0, 0.3+0.5, 0.2+0.8, ...], # "The" at position 1 [0.2+0.2, -0.5+0.7, 0.1+0.4, ...], # "cat" at position 2 [0.4+0.4, 0.2+0.9, 0.3+0.1, ...], # "sat" at position 3 ] # Now each word carries both: # - What the word means (from embeddings) # - Where the word is located (from position tags)
The original Transformer paper used a mathematical pattern based on sine and cosine waves. You don't need to understand the math — just know that:
Newer models like Llama and Mistral use an improved approach called RoPE (Rotary Position Embeddings).
Simple analogy: Think of a clock face with moving hands:
Word at position 1: Clock hand at 12 o'clock (0°) Word at position 2: Clock hand at 1 o'clock (30°) Word at position 3: Clock hand at 2 o'clock (60°) Word at position 4: Clock hand at 3 o'clock (90°) ...
How this connects to RoPE: Just like the clock hands rotate to show different times, RoPE literally rotates each word's embedding vector based on its position. Word 1 gets rotated 0°, word 2 gets rotated 30°, word 3 gets rotated 60°, and so on. This rotation encodes position information directly into the word vectors themselves.
Why this works:
Why this matters in practice:
Key takeaway: Position encoding ensures the model knows "The cat sat" is different from "sat cat The". Without this, word order would be lost!
This is the magic that makes Transformers work! Let's understand it with a story.
Imagine you're at a dinner party with 10 people. Someone mentions "Paris" and you want to understand what they mean:
Attention does exactly this for words in a sentence!
Let's process this sentence:
When the model processes the word "it", it needs to figure out: What does "it" refer to?
Step 1: The word "it" asks questions
Step 2: All other words offer information
Step 3: "it" calculates relevance scores
Step 4: "it" gathers information The model now knows: "it" = mostly "animal" + a bit of "tired" + tiny bit of others
The model creates three versions of each word:
The matching process:
# Simplified example (real numbers would be 768-dimensional) # Word "it" creates its Query: query_it = [0.8, 0.3, 0.9] # Looking for: subject, noun, living thing # Word "animal" has this Key: key_animal = [0.9, 0.4, 0.8] # Offers: subject, noun, living thing # How well do they match? Multiply and sum: relevance = (0.8×0.9) + (0.3×0.4) + (0.9×0.8) = 0.72 + 0.12 + 0.72 = 1.56 # High match! # Compare with "street": key_street = [0.1, 0.4, 0.2] # Offers: not-subject, noun, non-living thing relevance = (0.8×0.1) + (0.3×0.4) + (0.9×0.2) = 0.08 + 0.12 + 0.18 = 0.38 # Lower match # Convert to percentages (this is what "softmax" does): # "animal" gets 45%, "street" gets 8%, etc.
You might see this formula in papers:
Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V
What it means in plain English:
Where it comes from: Researchers from Google Brain discovered in 2017 that this mathematical formula effectively models how words should pay attention to each other. It's inspired by information retrieval (like how search engines find relevant documents).
You don't need to memorize this! Just remember: attention = figuring out which words are related and gathering information from them.
Let's see attention in action with actual numbers:
Sentence: "The animal didn't cross the street because it was tired"
When processing "it", the attention mechanism calculates:
Word Relevance Score What This Means ───────────────────────────────────────────────────────── "The" → 2% Article, not important "animal" → 45% Main subject! Likely referent "didn't" → 3% Verb helper, not the focus "cross" → 5% Action, minor relevance "the" → 2% Article again "street" → 8% Object/location, somewhat relevant "because" → 2% Connector word "it" → 10% Self-reference (checking own meaning) "was" → 8% Linking verb, somewhat relevant "tired" → 15% State description, quite relevant ───── Total = 100% (Scores sum to 100%)
Result: The model now knows "it" primarily refers to "animal" (45%), with some connection to being "tired" (15%). This understanding gets encoded into the updated representation of "it".
How does this actually update "it"? The model takes a weighted average of all words' Value vectors using these percentages:
# Each word has a Value vector (what information it contains) value_animal = [0.9, 0.2, 0.8] # Contains: mammal, four-legged, animate value_tired = [0.1, 0.3, 0.9] # Contains: state, adjective, fatigue value_street = [0.2, 0.8, 0.1] # Contains: place, concrete, inanimate # ... (other words) # Updated representation of "it" = weighted combination new_it = (45% × value_animal) + (15% × value_tired) + (8% × value_street) + ... = (0.45 × [0.9, 0.2, 0.8]) + (0.15 × [0.1, 0.3, 0.9]) + ... = [0.52, 0.19, 0.61] # Now "it" carries meaning from "animal" + "tired"
The word "it" now has a richer representation that includes information from "animal" (heavily weighted) and "tired" (moderately weighted), helping the model understand the sentence better.
Simple analogy: When you read a sentence, you notice multiple things simultaneously:
Multi-head attention lets the model do the same thing! Instead of one attention mechanism, models use 8 to 128 different attention "heads" running in parallel.
Example with the sentence "The fluffy dog chased the cat":
Important: These specializations aren't programmed! During training, different heads naturally learn to focus on different relationships. Researchers discovered this by analyzing trained models—it emerges automatically.
How they combine:
# Each head produces its own understanding: head_1_output = attention_head_1(text) # Finds subject-verb head_2_output = attention_head_2(text) # Finds adjective-noun head_8_output = attention_head_8(text) # Finds other patterns # Combine all heads into a rich understanding: final_output = combine([head_1_output, head_2_output, ..., head_8_output]) # Now each word has information from all types of relationships!
Why this matters: Having multiple attention heads is like having multiple experts analyze the same text from different angles. The final result is much richer than any single perspective.
After attention gathers information, each word needs to process what it learned. This is where the Feed-Forward Network (FFN) comes in.
Simple analogy:
What happens:
After "it" gathered information that it refers to "animal" and relates to "tired", the FFN processes this:
# Simplified version def process_word(word_vector): # Step 1: Expand to more dimensions (gives more room to think) bigger = expand(word_vector) # 768 numbers → 3072 numbers # Step 2: Apply complex transformations (the "thinking") processed = activate(bigger) # Non-linear processing # Step 3: Compress back to original size result = compress(processed) # 3072 numbers → 768 numbers return result
What's it doing? Let's trace through a concrete example using our sentence:
Example: Processing "it" in "The animal didn't cross the street because it was tired"
After attention, "it" has gathered information showing it refers to "animal" (45%) and relates to "tired" (15%). Now the FFN enriches this understanding:
Step 1 - What comes in:
Vector for "it" after attention: [0.52, 0.19, 0.61, ...] This already knows: "it" refers to "animal" and connects to "tired"
Step 2 - FFN adds learned knowledge:
Think of the FFN as having millions of pattern detectors (neurons) that learned from billions of text examples. When "it" enters with its current meaning, specific patterns activate:
Input pattern: word "it" + animal reference + tired state FFN recognizes patterns: - Pattern A activates: "Pronoun referring to living creature" → Strengthens living thing understanding - Pattern B activates: "Subject experiencing fatigue" → Adds physical/emotional state concept - Pattern C activates: "Reason for inaction" → Links tiredness to not crossing - Pattern D stays quiet: "Object being acted upon" → Not relevant here
What the FFN is really doing: It's checking "it" against thousands of patterns it learned during training, like:
Step 3 - What comes out:
Enriched vector: [0.61, 0.23, 0.71, ...] Now contains: pronoun role + animal reference + tired state + causal link (tired → didn't cross)
The result: The model now has a richer understanding: "it" isn't just referring to "animal"—it understands the animal is tired, and this tiredness is causally linked to why it didn't cross the street.
Here's another example showing how FFN removes uncertainty of word meanings:
Example - "bank":
Think of FFN as the model's "knowledge base" where millions of facts and patterns are stored in billions of network weights (the connections between neurons). Unlike attention (which gathers context from other words), FFN applies learned knowledge to that context.
It's the difference between:
Key insight:
Modern improvement: Newer models use something called "SwiGLU" instead of older activation functions. It provides better performance, but the core idea remains: process the gathered information to extract deeper meaning.
These might sound technical, but they solve simple problems. Let me explain with everyday analogies.
The Problem: Imagine you're editing a document. You make 96 rounds of edits. By round 96, you've completely forgotten what the original said! Sometimes the original information was important.
The Solution: Keep a copy of the original and mix it back in after each edit.
In the Transformer:
# Start with a word's representation original = [0.2, 0.5, 0.8, ...] # "cat" representation # After attention + processing, we get changes changes = [0.1, -0.2, 0.3, ...] # What we learned # Residual connection: Keep the original + add changes final = original + changes = [0.2+0.1, 0.5-0.2, 0.8+0.3, ...] = [0.3, 0.3, 1.1, ...] # Original info preserved!
Better analogy: Think of editing a photo:
Why this matters: Deep networks (96-120 layers) need this. Otherwise, information from early layers disappears by the time you reach the end.
The Problem: Imagine you're calculating daily expenses:
The huge number breaks everything.
The Solution: After each step, check if numbers are getting too big or too small, and adjust them to a reasonable range.
What normalization does:
Before normalization:
Word vectors might be: "the": [0.1, 0.2, 0.3, ...] "cat": [5.2, 8.9, 12.3, ...] ← Too big! "sat": [0.001, 0.002, 0.001, ...] ← Too small!
After normalization:
"the": [0.1, 0.2, 0.3, ...] "cat": [0.4, 0.6, 0.8, ...] ← Scaled down to reasonable range "sat": [0.2, 0.4, 0.1, ...] ← Scaled up to reasonable range
How it works (simplified):
# For each word's vector: # 1. Calculate average and spread of numbers average = 5.0 spread = 3.0 # 2. Adjust so average=0, spread=1 normalized = (original - average) / spread # Now all numbers are in a similar range!
Why this matters:
Key takeaway: These two tricks (residual connections + normalization) are like safety features in a car—they keep everything running smoothly even when the model gets very deep (many layers).
Transformers come in three varieties, like three different tools in a toolbox. Each is designed for specific jobs.
Think of it like: A reading comprehension expert who thoroughly understands text but can't write new text.
How it works: Sees the entire text at once, looks at relationships in all directions (words can look both forward and backward).
Training example:
Show it: "The [MASK] sat on the mat" It learns: "The cat sat on the mat" By filling in blanks, it learns deep understanding!
Real-world uses:
Popular models: BERT, RoBERTa (used by many search engines)
Key limitation: Can understand and classify text, but cannot generate new text. It's like a reading expert who can't write.
Think of it like: A creative writer who generates text one word at a time, always building on what came before.
How it works: Processes text from left to right. Each word can only "see" previous words, not future ones (because future words don't exist yet during generation!).
Training example:
Show it: "The cat sat on the" It learns: Next word should be "mat" (or "floor", "chair", etc.) By predicting next words billions of times, it learns to write!
Why only look backward? Because when generating text, future words don't exist yet—you can only use what you've written so far. It's like writing a story one word at a time: after "The cat sat on the", you can only look back at those 5 words to decide what comes next.
When predicting "sat": Can see: "The", "cat" ← Use these to predict Cannot see: "on", "the", "mat" ← Don't exist yet during generation
Real-world uses:
def calculate_ → it suggests the restPopular models: GPT-4, Claude, Llama, Mistral (basically all modern chatbots)
Why this is dominant: These models can both understand AND generate, making them incredibly versatile. This is what you use when you chat with AI.
Think of it like: A two-person team: one person reads and understands (encoder), another person writes the output (decoder).
How it works:
Training example:
Input (to encoder): "translate English to French: Hello world" Output (from decoder): "Bonjour le monde" Encoder understands English, Decoder writes French!
Real-world uses:
Popular models: T5, BART (less common nowadays)
Why less popular now: Decoder-only models (like GPT) turned out to be more versatile—they can do translation AND chatting AND coding, all in one architecture. Encoder-decoder models are more specialized.
Need to understand/classify text? → Encoder (BERT)
Need to generate text? → Decoder (GPT)
Need translation/summarization only? → Encoder-Decoder (T5)
Not sure? → Use Decoder-only (GPT-style)
Bottom line: If you're building something today, you'll most likely use a decoder-only model (like GPT, Claude, Llama) because they're the most flexible and powerful.
Now that you understand the components, let us see how they scale:
As models grow from small to large, here's what changes:
| Component | Small (125M params) | Medium (7B params) | Large (70B params) | |----|----|----|----| | Layers (depth) | 12 | 32 | 80 | | Hidden size (vector width) | 768 | 4,096 | 8,192 | | Attention heads | 12 | 32 | 64 |
Key insights:
1. Layers (depth) - This is how many times you repeat Steps 3 & 4
Example: Processing "it" in our sentence:
2. Hidden size (vector width) - How many numbers represent each word
3. Attention heads - How many different perspectives each layer examines
Where do the parameters live?
Surprising fact: The Feed-Forward Network (FFN) actually takes up most of the model's parameters, not the attention mechanism!
Why? In each layer:
In large models, FFN parameters outnumber attention parameters by 3-4x. That's where the "knowledge" is stored!
Simple explanation: Every word needs to look at every other word. If you have N words, that's N × N comparisons.
Concrete example:
3 words: "The cat sat" - "The" looks at: The, cat, sat (3 comparisons) - "cat" looks at: The, cat, sat (3 comparisons) - "sat" looks at: The, cat, sat (3 comparisons) Total: 3 × 3 = 9 comparisons 6 words: "The cat sat on the mat" - Each of 6 words looks at all 6 words Total: 6 × 6 = 36 comparisons (4x more for 2x words!) 12 words: Total: 12 × 12 = 144 comparisons (16x more for 4x words!)
The scaling problem:
| Sentence Length | Attention Calculations | Growth Factor | |----|----|----| | 512 tokens | 262,144 | 1x | | 2,048 tokens | 4,194,304 | 16x more | | 8,192 tokens | 67,108,864 | 256x more |
Why this matters: Doubling the length doesn't double the work—it quadruples it! This is why:
Solutions being developed:
These tricks help models handle longer texts without the exponential cost!
Important: This diagram represents the universal Transformer architecture. All Transformer models (BERT, GPT, T5) follow this basic structure, with variations in how they use certain components.
Let's walk through the complete flow step by step:
Let's trace "The cat sat" through this architecture:
Step 1: Input Tokens
Your text: "The cat sat" Tokens: ["The", "cat", "sat"]
Step 2: Embeddings + Position
"The" → [0.1, 0.3, ...] + position_1_tag → [0.1, 0.8, ...] "cat" → [0.2, -0.5, ...] + position_2_tag → [0.4, -0.2, ...] "sat" → [0.4, 0.2, ...] + position_3_tag → [0.8, 0.5, ...] Now each word is a 768-number vector with position info!
Step 3: Through N Transformer Layers (repeated 12-120 times)
Each layer does this:
Step 4a: Multi-Head Attention
- Each word looks at all other words - "cat" realizes it's the subject - "sat" realizes it's the action "cat" does - Words gather information from related words
Step 4b: Add & Normalize
- Add original vector back (residual connection) - Normalize numbers to reasonable range - Keeps information stable
Step 4c: Feed-Forward Network
- Process the gathered information - Apply learned knowledge - Each word's vector gets richer
Step 4d: Add & Normalize (again)
- Add vector from before FFN (another residual) - Normalize again - Ready for next layer!
After going through all N layers, each word's representation is incredibly rich with understanding.
Step 5: Linear + Softmax
Take the final word's vector: [0.8, 0.3, 0.9, ...] Convert to predictions for EVERY word in vocabulary (50,000 words): "the" → 5% "a" → 3% "on" → 15% ← High probability! "mat" → 12% "floor" → 8% ... (All probabilities sum to 100%)
Step 6: Output
Pick the most likely word: "on" Complete sentence so far: "The cat sat on" Then repeat the whole process to predict the next word!
Now that you've seen the complete flow, here's how each model type uses it differently:
1. Encoder-Only (BERT):
2. Decoder-Only (GPT, Claude, Llama):
3. Encoder-Decoder (T5):
Uses: TWO stacks - one encoder (steps 1-4), one decoder (full steps 1-6)
Encoder: Bidirectional attention to understand input
Decoder: Causal attention to generate output, also attends to encoder
Training: Input→output mapping ("translate: Hello" → "Bonjour")
Purpose: Translation, summarization, transformation tasks
The key difference: Same architecture blocks, different attention patterns and how they're connected!
It's a loop: For generation, this process repeats. After predicting "on", the model adds it to the input and predicts again.
The "N" matters:
This is universal: Whether you're reading a research paper about a new model or trying to understand GPT-4, this diagram applies. The core architecture is the same!
Understanding the architecture helps you make better decisions:
The context window is not just a number—it is a hard architectural limit. A model trained on 4K context cannot magically understand 100K tokens without modifications (RoPE interpolation, fine-tuning, etc.).
Tokens at the beginning and end of context often get more attention (primacy and recency effects). If you have critical information, consider its placement in your prompt.
Early layers capture syntax and basic patterns. Later layers capture semantics and complex reasoning. This is why techniques like layer freezing during fine-tuning work—early layers transfer well across tasks.
Every extra token in your prompt increases compute quadratically. Be concise when you can.
\n
\


