Language Model for IR
Language Model for IR
The Language Modeling (LM) approach to IR views retrieval as a generative process. Instead of calculating the probability of a document being relevant to a query, we estimate the probability that a query was generated by a “random walk” through the language model of a specific document.
Query Likelihood Model
The most common LM approach ranks documents by the probability :
To avoid zero probabilities for terms not in the document, smoothing is required (usually Jelinek-Mercer or Dirichlet):
where:
- is the Maximum Likelihood Estimate of term in document .
- is the probability of the term in the entire collection (background model).
- is the smoothing parameter.
The Generative Bridge
Imagine every document is an urn filled with words. When a user writes a query, they are essentially sampling words from an urn they have in mind. Retrieval is the process of figuring out which “urn” (document) was most likely to have produced those specific query words.
Key Components
- Document LMs: A separate probability distribution for every document in the collection.
- Smoothing: Critical because if a single query term is missing from a document, the product becomes zero. Smoothing “borrows” probability mass from the general collection.
- generative Model of Relevance: Shifts the focus from “Is relevant to ?” to “Could have as a summary/description?“.
Connections
- Alternative to: BM25, Vector Space Model, Binary Independence Model
- Extension: Retrieval-Augmented Generation (uses modern LMs for the generation part)
- Foundations: Probabilistic Information Retrieval.