Language Model for IR

Definition

Language Model for IR

The language-modeling (LM) approach to IR treats retrieval as a generative problem: estimate a probabilistic language model for each document, then rank documents by how likely that model is to have produced the query (or by how close it is to a query model). It reframes the question from “is relevant to ?” to “could ‘s model have generated ?” — a probabilistic alternative to Vector Space Model and BM25.

The generative bridge (the urn)

Imagine each document is an urn filled with words in document-specific proportions. A user with a document in mind samples query words from its urn. Retrieval inverts this: find the urn most likely to have produced the observed query words.

The Family of LM Approaches

  • Query Likelihood Model — the dominant instance: rank by , with Smoothing against the collection model. (Full formula, Jelinek–Mercer and Dirichlet smoothing, and intuition live in that note.)
  • Document likelihood — rank by , generating the document from a query model (less common; query models are sparse).
  • Model comparison (KL divergence / Relevance Models) — build a query model and a document model and rank by their divergence . This generalizes query likelihood and underlies pseudo-relevance-feedback relevance models.

Unifying view

\quad\xrightarrow{\;M_q=\text{empirical } q\;}\quad \sum_{t\in q} P(t\mid q)\log P(t\mid M_d)$$ where setting the query model $M_q$ to the raw query terms recovers (a rank-equivalent of) the [[Query Likelihood Model]].

Key Properties

  • Per-document models — one distribution per document in the collection.
  • Smoothing is essential — without it, one missing query term zeroes the product; Smoothing borrows mass from the collection model.
  • Probabilistic footing — a principled alternative to heuristic term-weighting; connects naturally to modern generation (Retrieval-Augmented Generation).

Connections

Appears In