Attention Mechanism

Why This Matters

Attention is the core innovation that makes modern LLMs work. It's the mechanism that lets a model look at all parts of the input simultaneously and decide which parts are most relevant to each other.

The Intuition

Imagine reading a long document and highlighting the most important parts for answering a specific question. Attention does this automatically — for every position in the text, it computes a "relevance score" against every other position.

In the sentence "The cat sat on the mat because it was tired", attention lets the model figure out that "it" refers to "cat" (not "mat") by computing high attention scores between "it" and "cat".

How It Works

The key equation: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Where:

Q (Query): "What am I looking for?"
K (Key): "What do I contain?"
V (Value): "What information do I provide?"

Think of it like a library search:

You have a query (your question)
Each book has a key (its catalog entry)
You match your query against all keys to find relevant books
You read the values (actual content) of the matched books

Multi-Head Attention

Instead of one attention computation, transformers use 8-96 "heads" in parallel. Each head learns to attend to different types of relationships:

Head 1 might track syntax (subject-verb agreement)
Head 2 might track coreference ("it" → "cat")
Head 3 might track positional patterns

Self-Attention vs Cross-Attention

Self-attention: Input attends to itself (used in encoders and decoders)
Cross-attention: Output attends to input (used in encoder-decoder models for translation)