Smoothing
Smoothing
Smoothing is a technique used in language models for information retrieval to adjust probability estimates. Its primary goal is to prevent zero-probability estimates for terms that do not appear in a specific document but are present in the general collection, while also accounting for the document’s content.
Why Smooth?
Without smoothing, if a document is missing even one term from a multi-word query, the language model would assign it a probability of 0 (). Smoothing “steals” a small amount of probability mass from seen terms and redistributes it to unseen terms using the background collection model.
Common Smoothing Methods
1. Jelinek-Mercer (Linear Interpolation)
Mixes the document model with the collection model using a fixed weight .
- : Higher values (e.g., 0.7) favor the collection (better for long queries), lower values (e.g., 0.1) favor the document (better for short queries).
2. Dirichlet Prior
Uses a pseudo-count from the collection.
- Intuition: As document length increases, the influence of the prior diminishes. It provides stronger length normalization than Jelinek-Mercer.
3. Absolute Discounting
Subtracts a constant from seen term counts. where is the number of unique terms in the document.
Connections
- Foundation: Query Likelihood Model
- Part of: Language Models for IR
- Relates to: TF-IDF (smoothing behaves similarly to IDF by downweighting common terms)