FiD

Definition

FiD (Fusion-in-Decoder)

FiD (Izacard & Grave, 2021) is a retrieval-augmented seq2seq architecture for open-domain QA that encodes each retrieved passage independently alongside the question, then fuses all passage representations in the decoder via cross-attention. By separating encoding (per-passage, parallel) from fusion (joint, in the decoder), FiD scales to 100+ passages where naive concatenation hits context limits and where RAG’s marginalization is slow.

Intuition

Split the work: encode locally, fuse globally

Naive RAG concatenates all passages into one prompt — the self-attention cost grows quadratically with the total number of tokens, so you can only afford a handful of passages. Lewis et al.’s RAG avoids this by marginalizing over documents, but inference is slow and each output is grounded in one document at a time.

FiD’s trick: run the encoder separately on each (question, passage) pair. Encoding passages of length costs instead of linear in the number of passages. The decoder then attends over the concatenation of all passage encodings, so cross-attention can synthesize evidence across passages to produce answers not stated verbatim in any single one.

Mathematical Formulation

Given question and retrieved passages , FiD builds one encoder representation per passage and concatenates them for the decoder:

where:

  • — the input question; — the -th retrieved passage (with title ), prepended with special markers (question:, title:, context:)
  • — encoder hidden states for passage , computed independently (no attention across passages in the encoder)
  • — concatenation; — the full set of token representations the decoder attends to
  • — autoregressive T5 decoder; its cross-attention ranges over all of , so a single answer can draw on multiple passages
  • — the generated answer; — previously generated tokens

Cost: linear vs. quadratic in passages

Independent encoding makes the dominant self-attention cost linear in the number of passages ; only the (much cheaper) decoder cross-attention sees all tokens jointly.

Key Properties / Variants

  • Scalability: Performance on Natural Questions improves monotonically with passage count up to 100+ passages, unlike concatenation which saturates/degrades at the context limit.
  • Cross-document synthesis: Decoder cross-attention fuses evidence; answers can be assembled from facts spread across several passages.
  • Architectural simplicity: Standard seq2seq (T5) with only modified input formatting — no new parameters, no marginalization machinery.
  • Disconnected retriever: Vanilla FiD uses a fixed retriever (BM25 or DPR); there is no end-to-end gradient flow to the retriever (contrast with end-to-end RAG). Atlas later closes this gap with a learnable, periodically reindexed dense Bi-Encoder.
  • Inference cost grows with : Decoder cross-attention scans all passage tokens each step; tractable but linear in .
  • Non-adaptive: Always retrieves a fixed number of passages regardless of query difficulty (contrast with adaptive Self-RAG).
  • Mitigates “lost in the middle”: Independent encoding sidesteps the positional attention decay that plagues single-context concatenation (each passage is encoded as if at the start).
Algorithm: FiD (Fusion-in-Decoder) — Inference
────────────────────────────────────────────────
Input: question q, retriever R, corpus D, generator (Encoder, Decoder), k
  1. Z ← R.retrieve(q, D, top_k = k)          # fixed retriever (BM25 / DPR)
  2. for each z_i in Z:                         # independent, parallelizable
       x_i  ← "question: " + q +
              " title: " + title(z_i) +
              " context: " + text(z_i)
       H_i  ← Encoder(x_i)                      # per-passage hidden states
  3. H ← concat(H_1, ..., H_k)                  # fuse in decoder input
  4. y ← Decoder.generate(cross_attend_over = H)  # cross-attention over all passages
  return y

FiD vs. RAG — what "fusion" means

In RAG (Lewis et al.), documents are latent variables: the model marginalizes and gradients flow to the query encoder. In FiD there is no marginalization — all passages are fed jointly to the decoder and the retriever is frozen. FiD trades end-to-end learnability for a large gain in how many passages it can exploit (44.5 EM for RAG vs. up to 68.2 EM for FiD with gold passages on Natural Questions).

Connections

Appears In