IR Lecture 8: Generative Retrieval

Overview

Generative Information Retrieval (GenIR) represents a paradigm shift from the traditional “retrieve-then-rank” pipeline. Instead of using an external index (inverted or vector-based) to look up documents, GenIR encodes the entire corpus into the parameters of a single sequence-to-sequence model. Retrieval is framed as an autoregressive decoding task, where the model directly generates the identifier (DocID) of relevant documents given a query.

1. The Paradigms of Retrieval

DimensionClassical (Retrieve-then-Rank)Generative IR
IndexInverted / Vector Index (External)Model Parameters (Internal)
Retrieval StepLookup + Scoring (Deterministic)Autoregressive Decoding
DifferentiablePartial (Dense only)Fully End-to-End
Corpus UpdateRe-index (Fast)Fine-tune (Slow)
InterpretabilityDocument-level rankingToken-level generation
KnowledgeUp-to-date (External)Parametric memory (Static/Stale)
ScalabilitySub-linear (ANN index)Linear in document count (Model capacity)

Simplification

GenIR replaces the large external index with an internal one. The model “memorizes” documents by associating their content with specific identifiers.


2. Core Operations in Generative Retrieval

2.1 Indexing (Memorization Phase)

The model learns to map document content to its corresponding document identifier (docid).

Indexing Loss

Given a corpus and docid set , the goal is to maximize the likelihood of the docid given the document : where are the model parameters.

2.2 Retrieval (Inference Phase)

Given a query , the model generates the most relevant docids.

Retrieval Loss

Given a query set and relevant docids :

2.3 Unified Training

The model is optimized end-to-end using a global objective that combines both indexing and retrieval:

2.4 Inference (Decoding)

To retrieve the top-k documents, the model uses constrained beam search to generate docid strings token-by-token: Generation stops when the <EOS> token is produced. Candidates are ranked by their joint probability.


3. Key Architectures & Models

3.1 Differentiable Search Index (DSI)

Introduced by [Tay et al., 2022], DSI is the seminal GenIR model.

  • Phase 1 Indexing: Maps document text/prefix to DocID.
  • Phase 2 Retrieval: Maps query to DocID.
  • Inference: Uses a trie-based constrained beam search to ensure only valid DocIDs are generated.

3.2 Neural Corpus Indexer (NCI)

Three major improvements over DSI:

  1. Prefix-aware weight-adaptive decoder: Uses different heads for different levels of the identifier hierarchy.
  2. Query Augmentation: Generates synthetic queries for documents and trains on (synthetic query, docid) pairs.
  3. Consistency Training: Ensures similar queries produce the same DocID.

3.3 GENRE (Generative Entity Retrieval)

Focuses on entity linking by generating human-readable Wikipedia titles as identifiers.

  • Uses pre-trained language knowledge (BART).
  • Wikipedia titles act as a natural, structured ID space.

4. Document Identifier (DocID) Design

Choosing how to represent a document as a string is critical.

ID TypeExampleProsCons
Naive/Atomic1024, doc_1SimpleNo semantic meaning; hard to learn
Semantic StringTitle, URLleverages LLM pre-trainingCan be long, ambiguous, or rare tokens
Hierarchical (Clusters)1.2.5.4Efficient decoding via triesSensitive to clustering quality
Semantic NumericProduct of k-meansStructured, fixed lengthRequires separate clustering phase

Finding

Text-based docids (like titles) generalize better to unseen documents in dynamic corpora because they align with the language model’s pre-training distribution.


5. Robustness & Recent Progress

Generative Retrieval is currently facing several research challenges categorized under “Robustness”:

5.1 Explainability

Mechanistic research shows the decoder passes through three stages:

  1. Priming: No query-specific info used.
  2. Bridging: Cross-attention transfers info from encoder to decoder.
  3. Interaction: MLPs process info to predict docids in final layers.

5.2 Accuracy & Relevance Alignment

A major issue is aligning token-level generation with document-level relevance.

  • Traditional beam search might prune false negatives.
  • DRO (Direct Relevance Optimization): A proposal to optimize document relevance directly via pairwise ranking, eliminating the need for reinforcement learning.

5.3 Reliability (Autoregressive Limitations)

  • Autoregressive models can achieve perfect Top-1 precision but suffer in Top-k Recall due to local greedy pruning in beam search.
  • There is a lower bound on error related to the KL divergence between ground-truth and predicted step-wise marginal distributions.

5.4 Dynamic Corpora (Repeatability)

Most GenIR models are static. Handling additions/deletions/modifications is difficult:

  • Numeric IDs tend to stick to IDs seen during training.
  • Model Editing: Techniques like “Model Editing” are being explored to integrate new documents without full re-training.

5.5 Safety & Machine Unlearning

How do we delete a document from a model’s parameters (e.g., for GDPR “right to be forgotten”)?

  • Requires specialized Machine Unlearning algorithms to remove training data traces without full retraining.

6. Summary: The Retrieval Evolution

StageMechanismSpace
SparseExact Lexical OverlapVocabulary Terms
DenseSemantic SimilarityLatent Embeddings
GenerativeConditional GenerationModel Parameters

The GenIR Workflow

  1. Index: Feed “How to roast pumpkin seeds” Model learns to output DocID_567.
  2. Query: User asks “Pumpkin seed storage” Model decodes DocID_567 autoregressively using beam search.