Language Model for IR

Language Model for IR

The Language Modeling (LM) approach to IR views retrieval as a generative process. Instead of calculating the probability of a document being relevant to a query, we estimate the probability that a query was generated by a “random walk” through the language model of a specific document.

Query Likelihood Model

The most common LM approach ranks documents by the probability :

To avoid zero probabilities for terms not in the document, smoothing is required (usually Jelinek-Mercer or Dirichlet):

where:

  • is the Maximum Likelihood Estimate of term in document .
  • is the probability of the term in the entire collection (background model).
  • is the smoothing parameter.

The Generative Bridge

Imagine every document is an urn filled with words. When a user writes a query, they are essentially sampling words from an urn they have in mind. Retrieval is the process of figuring out which “urn” (document) was most likely to have produced those specific query words.

Key Components

  • Document LMs: A separate probability distribution for every document in the collection.
  • Smoothing: Critical because if a single query term is missing from a document, the product becomes zero. Smoothing “borrows” probability mass from the general collection.
  • generative Model of Relevance: Shifts the focus from “Is relevant to ?” to “Could have as a summary/description?“.

Connections

Appears In