Transformers
Transformers
The Transformer is a deep learning architecture based entirely on attention mechanisms, dispensing with recurrence (RNNs) and convolutions (CNNs). It allows for massive parallelization and state-of-the-art performance in sequence modeling.
Scaled Dot-Product Attention
The “heart” of the transformer is the attention function mapping a query (), keys (), and values ():
where is the dimensionality of the keys.
The Search Analogy
Think of a transformer like a search engine:
- You have a Query (what you are looking for).
- You compare it against Keys (the labels or “indices” of all available info).
- You take a weighted sum of the Values (the actual content) based on how well their Keys matched your Query.
Key Components
- Multi-Head Attention: Runs multiple attention layers in parallel to capture different types of relationships (e.g., syntax vs. semantics).
- Positional Encoding: Since the model has no recurrence, it lacks an inherent sense of word order. Sinusoidal or learned encodings are added to inputs to inject position information.
- Encoder-Decoder Structure:
- Encoder: Bi-directional context (e.g., BERT).
- Decoder: Uni-directional/autoregressive context (e.g., GPT).
- Combined: For translation or summarization (e.g., T5).
Connections
- Foundation for: BERT (used in COIL, ColBERT), T5 (used in monoT5, DocT5Query, DSI), and GPT.
- Replaced: LSTMs and GRUs in most NLP and Neural Reranking tasks.