Smoothing

Smoothing

Smoothing is a technique used in language models for information retrieval to adjust probability estimates. Its primary goal is to prevent zero-probability estimates for terms that do not appear in a specific document but are present in the general collection, while also accounting for the document’s content.

Why Smooth?

Without smoothing, if a document is missing even one term from a multi-word query, the language model would assign it a probability of 0 (). Smoothing “steals” a small amount of probability mass from seen terms and redistributes it to unseen terms using the background collection model.

Common Smoothing Methods

1. Jelinek-Mercer (Linear Interpolation)

Mixes the document model with the collection model using a fixed weight .

  • : Higher values (e.g., 0.7) favor the collection (better for long queries), lower values (e.g., 0.1) favor the document (better for short queries).

2. Dirichlet Prior

Uses a pseudo-count from the collection.

  • Intuition: As document length increases, the influence of the prior diminishes. It provides stronger length normalization than Jelinek-Mercer.

3. Absolute Discounting

Subtracts a constant from seen term counts. where is the number of unique terms in the document.

Connections

Appears In