Imagine you walk into a massive library looking for a book about black holes. You walk up to the front desk and describe what you want: “I need something about the physics of black holes.” The librarian listens to your description, compares it against every book’s title and summary in the catalog, and then hands you the three most relevant books.
That is exactly how the attention mechanism works in a Transformer:
The librarian compares your Query against every Key, scores the match, and then gives you a weighted blend of the most relevant Values. The stronger the match between your Query and a Key, the more of that book’s Value you receive.
This one mechanism — repeated billions of times inside models like GPT, Claude, and Gemini — is what lets them understand context, resolve pronouns, and reason about meaning.
The core formula for Scaled Dot-Product Attention is:
Let us decode each piece:
| Symbol | Meaning | Library Analogy |
|---|---|---|
| Query matrix | Your search description | |
| Key matrix | Every book’s catalog entry | |
| Value matrix | The actual book contents | |
| Transpose of | Rearranging the catalog for comparison | |
| Dimension of vectors | Length of each catalog entry | |
| Square root of | A scaling factor to keep scores stable | |
| softmax | Normalization function | Converting match scores into percentages |
Do not worry if this looks abstract. We will walk through every single operation with real numbers. By the end of this post, each symbol in that formula will feel intuitive.
The key idea: we compare what each word is looking for () with what every other word offers (), then collect the actual information () from the most relevant words.
Neural networks do not understand text. They understand numbers. So before attention can do anything, we need to convert words into vectors — lists of numbers that capture meaning.
Let us use a simple sentence: “I love AI”
In a real model, each word becomes a vector of 512 or 1024 numbers. For our walkthrough, we will use tiny 4-dimensional vectors so we can do the math by hand:
Think of each dimension as a different “feature detector.” Dimension 1 might respond to pronouns. Dimension 2 might respond to verbs. Dimension 3 might respond to technical terms. The exact interpretation does not matter — the model learns these representations during training.
We stack these three vectors into an input matrix of shape (3 words, 4 dimensions each):
Each row is a word. Each column is a dimension. This matrix is our starting point.
Click through the words above to see how text becomes vectors. In a real model, these embeddings are learned from massive datasets, not set by hand. But the principle is the same: every word becomes a row of numbers.
A single set of embeddings is not enough. The attention mechanism needs three different “views” of the input — one for searching, one for being searched, and one for delivering content.
We create these three views by multiplying with three separate weight matrices:
Think of each weight matrix as a lens that projects the input into a different space. extracts “what am I looking for?” information. extracts “what do I contain?” information. extracts “what information should I share?” information.
In our example, we want , , and vectors of size 3 (so ). Each weight matrix has shape (mapping from 4 input dimensions to 3 output dimensions). Here are our weight matrices:
Let us compute . For the first row (“I”):
Doing this for all three words:
Similarly, :
And :
In a real Transformer, these weight matrices are not set by hand. They are learned during training — the model discovers the best projections automatically by seeing billions of examples.
Step through the projection above to see how each word’s embedding gets multiplied by the weight matrices to produce Q, K, and V. Notice how the same input produces three different representations depending on which lens we look through.
Now comes the heart of attention: comparing every word’s Query against every word’s Key. We do this by computing (the Query matrix times the transpose of the Key matrix).
First, let us transpose . Transposing means swapping rows and columns:
Rows become columns. The element at position (i,j) moves to (j,i).
Click “Transpose” above to watch each element physically move to its new position. The diagonal elements (0, 1, 1) stay put while the off-diagonal pairs swap places.
Now we multiply () by () to get a score matrix. Each cell tells us how much one word should “pay attention” to another.
For “I” attending to “love”: dot product of ‘s first row with ‘s second column:
Watch the multiplication above: the row slides across the column, element pairs light up one by one, and the running total accumulates to the final score of 5.
The word “I” scores “love” a 5 — the highest score in the entire matrix. This means “I” finds “love” the most relevant word in the sentence. That makes linguistic sense: “I” is a pronoun that needs context, and “love” tells us what “I” is doing.
Computing the full score matrix:
Key observations:
Select any pair of words above to see the dot product computation step by step. Each element of the Query vector pairs with the corresponding element of the Key vector, they multiply, and the results sum to produce a single attention score.
Before we move on, let us visualize these scores as a heatmap. Each cell’s color intensity represents how strongly one word attends to another.
The heatmap makes the attention pattern immediately visible. The brightest cell is “I” attending to “love” — that score of 5 dominates. “love” is a uniform gray because it distributes attention equally. This visual pattern tells us something about the sentence structure: the pronoun reaches out to the verb, while the verb stands in the middle connecting everything.
Select a source word above and watch particles flow to each target. The thicker the stream, the higher the attention weight. “I” sends a torrent toward “love” while barely trickling to itself.
Raw scores can grow very large. In our tiny example the maximum is 5, but in a real model with 512-dimensional vectors, dot products can reach into the hundreds. That causes a problem.
The next step in the formula is softmax, which converts scores into probabilities using exponentials. If the scores are too large, is astronomically bigger than , and softmax will push all probability mass onto the highest score while assigning near-zero to everything else. The model cannot learn from such extreme distributions.
The fix is simple: divide every score by . In our case, , so we divide by :
The relative order is unchanged — “I” still attends most to “love” — but the magnitudes are smaller and more manageable for softmax.
Why specifically? The dot product of two random vectors of dimension has an expected magnitude proportional to . Dividing by normalizes the variance back to approximately 1, keeping the softmax outputs in a stable range regardless of vector dimension.
Toggle the scaling above to see how dividing by compresses the scores into a range that softmax can work with. Notice that the relative ranking stays the same — only the magnitudes change.
Softmax converts each row of scaled scores into a probability distribution. Every value becomes a number between 0 and 1, and each row sums to exactly 1.
The formula for one value in a row:
Let us work through the “I” row:
Dividing each by the sum:
The word “I” pays 70.7% of its attention to “love”, 22.3% to “AI”, and only 7.0% to itself. The softmax has amplified the differences — the original score for “love” was only about 3x the score for “I”, but the probability gap is now 10x.
For the “love” row, all scores are equal (1.732 each), so softmax produces uniform probabilities:
For the “AI” row:
The full Attention Weight Matrix:
Each row sums to 1.0. Softmax guarantees this.
Watch the softmax transformation above. The raw scores go through exponentiation (which stretches differences) and then normalization (which forces each row to sum to 1). The result is a set of clean probabilities that the model can use to mix information.
Drag the sliders above to see how changing the input scores reshapes the probability distribution. Try making all scores equal — the distribution flattens to uniform. Then drag one score high and watch softmax sharpen into a near-one-hot.
The last step: we multiply the Attention Weight Matrix by the Value matrix . This produces the final output — one new vector for each word.
For “I”:
Notice: the output for “I” is dominated by the Value of “love” (weight 0.707). The second dimension (1.637) comes almost entirely from “love“‘s Value vector. This is the power of attention — “I” now carries information about “love” in its representation.
For “love”:
For “AI”:
Every word now has a new representation that is a weighted mixture of all words’ Values. The output for “I” is no longer just about “I” — it contains information from “love” and “AI”, weighted by how relevant they are. This is how the model builds context-aware representations.
Click on each word above to see how its output vector is assembled from the Value vectors of all words. The colored segments show how much each source word contributes to each dimension of the output.
Let us review the full pipeline in one view:
Step 1: Start with input embeddings (each word becomes a vector).
Step 2: Project through three learned weight matrices to get (queries), (keys), and (values).
Step 3: Compute attention scores by multiplying with . Each score measures how relevant one word is to another.
Step 4: Scale scores by dividing by to prevent softmax from producing extreme probabilities.
Step 5: Apply softmax to convert scores into attention weights (probabilities that sum to 1).
Step 6: Multiply attention weights by to get the final output — each word’s new context-aware representation.
Step through the full pipeline above. Watch data flow from raw text through embeddings, projections, scoring, scaling, normalization, and final output. Every cell in the formula has a corresponding visual step.
Before we finish, test your understanding:
Every large language model — GPT, Claude, Gemini, Llama — runs this exact computation billions of times during inference. The numbers we used were tiny (3 words, 4 dimensions) for clarity, but the math is identical at scale. Understanding these six steps means understanding the core mechanism behind modern AI.