Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors

Self Attention in Transformer

đź”¶ What is Self-Attention?

Self-attention is the core mechanism in the Transformer architecture (Vaswani et al., 2017) that allows the model to weigh the importance of different words in a sequence when encoding a particular word.

In simpler terms:

For example, in the sentence: “The cat sat on the mat because it was tired,” the word “it” should pay more attention to “cat” (not “mat”) to understand what “it” refers to.

It helps the model figure out which words to pay more attention to when processing a specific word.

Lets make it more simple, imagine you’re reading a sentence and trying to understand what each word means in context.

For example:

“The cat sat on the mat because it was tired.”

When you read “it”, your brain automatically connects it to “cat” — you pay attention to the right part of the sentence to understand the meaning.

That’s basically what self-attention does inside a Transformer:
👉 it helps the model figure out which other words are important for understanding each word.

đź”¶ How Does Self-Attention Work?

For each word in the input sequence:

  1. Compute three vectors:
    • Query (Q)
    • Key (K)
    • Value (V)
    These are obtained by multiplying the word embedding with learned weight matrices.
  2. Calculate attention scores:
    • Compute similarity between Query of the current word and Key of all words.
    • This gives a score indicating how much focus to put on each word.
  3. Normalize scores:
    • Apply softmax to turn scores into probabilities (attention weights).
  4. Compute weighted sum:
    • Multiply attention weights with Value vectors of all words and sum them up.
    • This gives the final representation of the current word, enriched by its context.

Lets simple it, for every word (or token) in a sentence:

  1. Look around → Each word “looks” at the other words in the sentence.
  2. Decide what’s important → The model figures out how strongly each word should be connected to the others.
  3. Mix the information → Each word updates its meaning by blending in information from the words it paid attention to.

So, when the Transformer processes “it”, it “realizes” that “cat” is important, not “mat”.

đź”¶ Formula (Scaled Dot-Product Attention)

  • Q: Query matrix
  • K: Key matrix
  • V: Value matrix
  • dk​: Dimension of key (used for scaling)

đź”¶ Benefits of Self-Attention

âś… Captures long-range dependencies (no matter how far words are)
âś… Enables parallel processing (unlike RNNs)
âś… It figures out meaning dynamically depending on the sentence.

đź”¶ Visualization (conceptual)

vbnetCopyEditInput:   The   cat   sat   on   the   mat
Weights: 0.1   0.5   0.2   0.05  0.1  0.05

For the word “cat”, it may pay more attention to itself and nearby words, while “sat” may look at both “cat” and “mat”.

🌟 Summary in one line

Self-attention helps each word in a sentence understand its meaning by looking at — and learning from — all the other words.


🌟 What is Query (Q)?

The Query vector is like the “question” that one word asks about how much it should care about other words in the sentence.

For each word:

  • The model generates a Query vector.
  • This vector is compared against the Key vectors of all words (including itself).
  • The comparison gives attention scores that tell which words matter most when understanding this word.

đź—Ł Simple example

Sentence:

“The cat sat on the mat.”

Let’s say we’re working on the word “cat”.

  • We create Query(cat) → this is like asking:
    “Who in this sentence is important for me to understand my meaning?”

We then compare Query(cat) with Key(The), Key(cat), Key(sat), Key(on), Key(the), Key(mat).
This gives scores like:

  • The → 0.1
  • cat → 0.4
  • sat → 0.3
  • on → 0.1
  • the → 0.05
  • mat → 0.05

So, “cat” pays most attention to itself and a little to “sat.”

🛠️ How is Query created?

For each word: Query=EmbeddingĂ—WQ

  • Embedding → the vector representing the word.
  • WQ → a learned weight matrix.
  • Result → Query vector.

🌟 What is Key (K)?

The Key vector is like a label or tag that tells other words:
👉 “Here’s the kind of information I carry.”

In the self-attention process, every word has:

  • A Query → what it wants to know
  • A Key → what it offers to others
  • A Value → the actual content it provides

đź—Ł Example in action

Sentence:

“The cat sat on the mat.”

Let’s say the model is working on “cat”:

  • It has a Query(cat) → asking: “Who is important for me?”

Then, it compares Query(cat) to the Keys of all words:

  • Key(The)
  • Key(cat)
  • Key(sat)
  • Key(on)
  • Key(the)
  • Key(mat)

These Keys are like ID cards saying what each word is about.
By comparing Query to Keys, the model decides which words deserve attention.

🛠️ How is Key created?

For each word: Key=EmbeddingĂ—WK

  • Embedding → vector of the word.
  • WK → learned weight matrix.
  • Result → Key vector.

🚀 Intuition

✅ Query → What this word is looking for.
✅ Key → What each word offers as a “summary.”
✅ Together → The model measures how well the Query and Key match to decide attention weights.


🌟 What is Value (V)?

The Value vector carries the actual information that will be passed along once attention is decided.

In simpler words:

  • The Query figures out what to look for.
  • The Key explains what each word offers.
  • The Value provides the actual content that will be mixed into the final output.

So, after the model compares Queries and Keys and decides which words to focus on,
👉 it collects the Values (weighted by the attention scores) to build a richer meaning for each word.

đź—Ł Example

Sentence:

“The cat sat on the mat.”

Let’s say we’re focusing on “cat”:

  • We calculate:
    • Query(cat)
    • Compare Query(cat) with Key(The), Key(cat), Key(sat), Key(on), Key(mat)
    • Get attention scores → say, 0.1, 0.5, 0.3, 0.05, 0.05

Finally:

  • We take the Values from each word → Value(The), Value(cat), Value(sat), Value(on), Value(mat)
  • Multiply them by the attention scores
  • Sum them → this gives the updated, context-aware meaning of “cat.”

🛠️ How is Value created?

For each word: Value=EmbeddingĂ—WV

  • Embedding → vector of the word.
  • WV → learned weight matrix.
  • Result → Value vector.

🚀 Intuition

✅ Query → “What am I looking for?”
✅ Key → “What can I offer to others?”
✅ Value → “Here’s my full info if you decide I matter.”

List of Popular Self-Attention Models

ModelD (Embedding Size)# HeadsHead Dim (D / heads)Notes
GPT-2 Small7681264117M parameters
GPT-2 Medium10241664345M parameters
GPT-2 Large12802064762M parameters
GPT-2 XL160025641.5B parameters
GPT-3 (175B)1228896128Huge, closed weights
GPT-4 (est.)~12288–32768??Specs not fully disclosed
BERT Base7681264110M parameters
BERT Large10241664340M parameters
DistilBERT7681264Smaller BERT
RoBERTa Base7681264BERT optimized
RoBERTa Large10241664More training data
T5 Small512864Text-to-text model
T5 Base7681264Encoder-decoder model
T5 Large10241664
XLNet Base7681264Permutation-based
XLNet Large10241664
ALBERT Base7681264Shared weights
ALBERT Large10241664
TinyBERT3121226For mobile devices
MobileBERT5124128Highly efficient
Longformer768–102412–1664Long sequence support
ViT-B/16 (Vision Transformer)7681264Used for image patches
ViT-L/3210241664Larger image model

Observations

  • Most models keep head size at 64 for stability and optimization.
  • D is typically divisible by # heads (to avoid errors in attention reshaping).
  • Bigger models (GPT-3, GPT-4) scale up both D and # heads heavily.
  • Lightweight models (MobileBERT, TinyBERT) trade off D and # heads for speed and memory.

Q. What is learned weight matrix ?

Categories LLM

11 thoughts on “Self Attention in Transformer”

Leave a Comment