Transformers

Transformers

The Transformer is a deep learning architecture based entirely on attention mechanisms, dispensing with recurrence (RNNs) and convolutions (CNNs). It allows for massive parallelization and state-of-the-art performance in sequence modeling.

Scaled Dot-Product Attention

The “heart” of the transformer is the attention function mapping a query (), keys (), and values ():

where is the dimensionality of the keys.

The Search Analogy

Think of a transformer like a search engine:

  1. You have a Query (what you are looking for).
  2. You compare it against Keys (the labels or “indices” of all available info).
  3. You take a weighted sum of the Values (the actual content) based on how well their Keys matched your Query.

Key Components

  1. Multi-Head Attention: Runs multiple attention layers in parallel to capture different types of relationships (e.g., syntax vs. semantics).
  2. Positional Encoding: Since the model has no recurrence, it lacks an inherent sense of word order. Sinusoidal or learned encodings are added to inputs to inject position information.
  3. Encoder-Decoder Structure:
    • Encoder: Bi-directional context (e.g., BERT).
    • Decoder: Uni-directional/autoregressive context (e.g., GPT).
    • Combined: For translation or summarization (e.g., T5).

Connections

Appears In