BM25

BM25 (Best Matching 25)

BM25 is a probabilistic ranking function based on the binary independence model. It is the most widely-used unsupervised retrieval model and the standard baseline in information retrieval.

BM25 Scoring Function

$BM25 (q, d) = \sum_{t \in q} IDF (t) \cdot \frac{f ( t , d ) \cdot ( k _{1} + 1 )}{f ( t , d ) + k _{1} \cdot ( 1 - b + b \cdot \frac{∣ d ∣}{avgdl} )}$

where:

$f (t, d)$ — term frequency of term $t$ in document $d$

$∣ d ∣$ — length of document $d$ (in words)

$avgdl$ — average document length in the collection

$k_{1}$ — term frequency saturation parameter (typically 1.2-2.0)

$b$ — length normalization parameter (typically 0.75)

$IDF (t) = lo g \frac{N - n ( t ) + 0.5}{n ( t ) + 0.5}$ — inverse document frequency

What Each Part Does

IDF: Rare terms are more informative → higher weight

TF saturation: First occurrence matters most; additional occurrences have diminishing returns (controlled by $k_{1}$ ). As $k_{1} \to \infty$ , approaches raw TF; as $k_{1} \to 0$ , binary presence/absence.

Length normalization: Longer documents naturally have higher TF. Parameter $b$ controls how much to penalize length. $b = 1$ : full normalization; $b = 0$ : no normalization.

Parameter Effects

Parameter	Low	High
$k_{1}$	Binary (term present/absent)	Raw term frequency
$b$	No length normalization	Full length normalization

Connections

Derived from: Binary Independence Model, probabilistic retrieval
Compared with: TF-IDF, Query Likelihood Model
Neural extensions: provides first-stage retrieval for Neural Reranking, Multi-Stage Ranking
Sparse baselines: Dense Retrieval and Learned Sparse Retrieval are often benchmarked against BM25

Study Notes

Explorer

BM25

BM25

Parameter Effects

Connections

Appears In

Graph View

Table of Contents

Backlinks