Language Model for IR
Definition
Language Model for IR
The language-modeling (LM) approach to IR treats retrieval as a generative problem: estimate a probabilistic language model for each document, then rank documents by how likely that model is to have produced the query (or by how close it is to a query model). It reframes the question from “is relevant to ?” to “could ‘s model have generated ?” — a probabilistic alternative to Vector Space Model and BM25.
The generative bridge (the urn)
Imagine each document is an urn filled with words in document-specific proportions. A user with a document in mind samples query words from its urn. Retrieval inverts this: find the urn most likely to have produced the observed query words.
The Family of LM Approaches
- Query Likelihood Model — the dominant instance: rank by , with Smoothing against the collection model. (Full formula, Jelinek–Mercer and Dirichlet smoothing, and intuition live in that note.)
- Document likelihood — rank by , generating the document from a query model (less common; query models are sparse).
- Model comparison (KL divergence / Relevance Models) — build a query model and a document model and rank by their divergence . This generalizes query likelihood and underlies pseudo-relevance-feedback relevance models.
Unifying view
\quad\xrightarrow{\;M_q=\text{empirical } q\;}\quad \sum_{t\in q} P(t\mid q)\log P(t\mid M_d)$$ where setting the query model $M_q$ to the raw query terms recovers (a rank-equivalent of) the [[Query Likelihood Model]].
Key Properties
- Per-document models — one distribution per document in the collection.
- Smoothing is essential — without it, one missing query term zeroes the product; Smoothing borrows mass from the collection model.
- Probabilistic footing — a principled alternative to heuristic term-weighting; connects naturally to modern generation (Retrieval-Augmented Generation).
Connections
- Primary instance: Query Likelihood Model
- Alternative to: BM25, Vector Space Model, Binary Independence Model
- Uses: Smoothing
- Modern extension: Retrieval-Augmented Generation