Transformers

Transformers

The Transformer is a deep learning architecture based entirely on attention mechanisms, dispensing with recurrence (RNNs) and convolutions (CNNs). It allows for massive parallelization and state-of-the-art performance in sequence modeling.

Scaled Dot-Product Attention

The “heart” of the transformer is the attention function mapping a query ( $Q$ ), keys ( $K$ ), and values ( $V$ ): $Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$

where $d_{k}$ is the dimensionality of the keys.

The Search Analogy

Think of a transformer like a search engine:

You have a Query (what you are looking for).

You compare it against Keys (the labels or “indices” of all available info).

You take a weighted sum of the Values (the actual content) based on how well their Keys matched your Query.

Key Components

Multi-Head Attention: Runs multiple attention layers in parallel to capture different types of relationships (e.g., syntax vs. semantics).
Positional Encoding: Since the model has no recurrence, it lacks an inherent sense of word order. Sinusoidal or learned encodings are added to inputs to inject position information.
Encoder-Decoder Structure:
- Encoder: Bi-directional context (e.g., BERT).
- Decoder: Uni-directional/autoregressive context (e.g., GPT).
- Combined: For translation or summarization (e.g., T5).

Connections

Foundation for: BERT (used in COIL, ColBERT), T5 (used in monoT5, DocT5Query, DSI), and GPT.
Replaced: LSTMs and GRUs in most NLP and Neural Reranking tasks.

Study Notes

Explorer

Transformers

Transformers

Key Components

Connections

Appears In

Graph View

Table of Contents

Backlinks