BM25
BM25 (Best Matching 25)
BM25 is a probabilistic ranking function based on the binary independence model. It is the most widely-used unsupervised retrieval model and the standard baseline in information retrieval.
BM25 Scoring Function
where:
- — term frequency of term in document
- — length of document (in words)
- — average document length in the collection
- — term frequency saturation parameter (typically 1.2-2.0)
- — length normalization parameter (typically 0.75)
- — inverse document frequency
What Each Part Does
- IDF: Rare terms are more informative → higher weight
- TF saturation: First occurrence matters most; additional occurrences have diminishing returns (controlled by ). As , approaches raw TF; as , binary presence/absence.
- Length normalization: Longer documents naturally have higher TF. Parameter controls how much to penalize length. : full normalization; : no normalization.
Parameter Effects
| Parameter | Low | High |
|---|---|---|
| Binary (term present/absent) | Raw term frequency | |
| No length normalization | Full length normalization |
Connections
- Derived from: Binary Independence Model, probabilistic retrieval
- Compared with: TF-IDF, Query Likelihood Model
- Neural extensions: provides first-stage retrieval for Neural Reranking, Multi-Stage Ranking
- Sparse baselines: Dense Retrieval and Learned Sparse Retrieval are often benchmarked against BM25