Positional encoding
Telling the model where each token sits
In the last two sections we walked through attention and multi-head attention without ever telling the model which token came first. That was a deliberate fudge — the math you saw treats its input sequence as a bag. Permute the input tokens, get a permuted output. “The cat sat on the mat” and “the mat sat on the cat” would produce identical attention weights, which is clearly not what you want.
So before any real model starts running, we have to inject position information explicitly. That’s the job of positional encoding positional encoding Information added to embeddings so the model knows where each token sits in the sequence. See in glossary → .
The high-level idea is the same in every variant: each position in the sequence (0, 1, 2, …) gets associated with some pattern of numbers, and that pattern is mixed into the token’s representation so the network can tell positions apart. There are a few ways to do this, and which one you pick affects how the model generalizes to longer sequences than it saw during training.
Approach 1: Sinusoidal (the original Transformer)
The “Attention Is All You Need” paper added a fixed, hand-crafted pattern to each token’s embedding:
Each position gets a vector of sines and cosines at exponentially-spaced frequencies. Position 0 looks one way, position 1 looks slightly different, position 100 looks very different. The vector is added to the token embedding before the first layer.
This worked, but had a problem: it doesn’t generalize well past the training context length. If the model was trained on sequences of 2k tokens, sending in a position-encoded 4k token still works mechanically but the model has never seen those frequency patterns and behaves unpredictably.
Approach 2: Learned positional embeddings
GPT-2 used a much simpler scheme: a second embedding matrix, one row per position, learned alongside everything else. Same problem, more so: positions past the training length have completely random vectors.
Approach 3: RoPE (what almost everyone uses now)
The dominant scheme in modern open-weight LLMs (Llama, Mistral, Qwen, DeepSeek) is Rotary Position Embeddings (RoPE) RoPE Rotary Position Embeddings — rotates Q/K vectors by an angle proportional to position. Standard in modern LLMs. See in glossary → . It works differently from the first two approaches. The token embedding itself is not modified. Instead, the position information is mixed in later, inside the attention computation — specifically, by rotating the query and key vectors before they’re dot-producted.
Recall from section 4 that for every token, attention computes a query and a key — vectors of, say, 128 floats each per head. RoPE’s trick is to look at as a sequence of pairs of consecutive floats: — so 64 pairs in total. Each pair is treated as a 2-D arrow lying in its own little plane, and that arrow gets rotated by an angle that depends on the token’s position in the prompt. The same rotation is applied to the matching key.
Different pairs rotate at very different rates. The first pair spins fast — many full rotations across just a few hundred positions. The last pair barely budges, even across thousands of positions. The combined “fingerprint” of these fast and slow rotations is what carries position information.
Try it below. The slider is the token’s position in the prompt — 0 means the first token, 2,000 means a token deep into a long context. Each circle is one of those query/key pairs. As you move the slider, watch the leftmost circles (fast pairs) sweep wildly while the rightmost (slow pairs) barely move.
The payoff comes from how attention later uses these rotated vectors. Attention scores are dot products of queries and keys. After RoPE, that dot product depends only on the difference in rotation angles — which is determined by the difference in the two tokens’ positions. So RoPE bakes in relative position information automatically, with no learned parameters and no stored “this is position 17” vectors anywhere.
Three things that fell out of this:
-
It generalizes well to longer contexts. Because the dot product only sees the difference in rotation, the model isn’t memorizing “position 17 looks like this.” It’s reading “you are 5 tokens apart” — and that’s the same kind of signal whether you’re at position 100 or position 100,000.
-
No new parameters. RoPE is fixed math; there’s nothing to learn. The model only has to learn to use it.
-
Extending context is tractable. Several tricks ( NTK-aware scaling NTK-aware scaling A RoPE-extension trick: instead of linearly shrinking all positions (which over-compresses the fast-spinning low-i pairs), adjust the rotation base — the 10000 in 10000^(2i/d) — so high-frequency pairs are preserved while only the slow pairs get stretched. Named after the Neural Tangent Kernel theory it was originally motivated by. Better quality than plain position interpolation at modest extension factors. See in glossary → , YaRN YaRN Yet another RoPE eNtension method. Combines NTK-aware scaling with a length-dependent attention-score scaling and a "ramp" that smoothly transitions between high- and low-frequency treatment. Currently the highest-quality way to extend a RoPE model's context length without retraining; used to ship Llama-3, Qwen-2, and others at 128k+ contexts. See in glossary → , position interpolation position interpolation (PI) A RoPE-extension trick: linearly scale incoming positions down so a model trained at length L "sees" a longer context as if it were still length L. To go from 4k to 16k, divide all positions by 4 before rotating. Cheap, effective for short extensions, but degrades quality on the tasks the model was already good at. See in glossary → ) let you take a model trained at 8k context and extend it to 128k or more by adjusting the frequency base of the rotations. This is why so many models now ship in “long context” variants without full retraining.
Where positional encoding actually lives in the pipeline
The three approaches plug into the pipeline at different points.
Sinusoidal and learned PE happen once, at the very top, before the first transformer layer:
hidden = embed(token_ids) + positional_encoding(positions)
The position-augmented vectors then flow unchanged through every layer.
RoPE happens every layer, inside attention, after queries and keys have been projected:
q = project_q(hidden); q = apply_rope(q, position)
k = project_k(hidden); k = apply_rope(k, position)
# v is not rotated
Why no rotation on values? Because the attention score (the part that decides who attends to whom) only involves and . The values are just the “content” that gets mixed in once the weights are decided, and there’s no need to position-encode them.
We’ve now closed the loop on the cross-token mixing half of a transformer block: attention (with positional information now properly accounted for) lets every token gather information from every other relevant token. The other half of every block is much simpler — it just transforms each token’s vector independently, in a per-position feed-forward network called the MLP. That’s next.