Binary Independence Model
Binary Independence Model (BIM)
The Binary Independence Model is a classic probabilistic model for IR. It makes two fundamental assumptions:
- Binary: Documents and queries are represented as binary incidence vectors (a term is either present or absent).
- Independence: The presence of one term is independent of the presence of any other term, given the relevance or non-relevance of the document.
Retrieval Status Value (RSV)
The BIM ranks documents using the log-odds of relevance:
where:
- — probability that term is present in a relevant document.
- — probability that term is present in a non-relevant document.
Counting Evidence
BIM treats terms as clues. If a term is very likely to appear in “good” docs and very unlikely in “bad” docs, seeing that term in a document is strong evidence for relevance. By assuming independence, we can simply add up the “weight of evidence” for every matching term to get a final score.
Key Properties
- Simplistic but Foundation: It ignores term frequency (TF) and document length, which makes it less effective than BM25 on its own.
- Basis for BM25: BM25 was created by extending BIM with TF and length normalization.
- Probability Ranking Principle: BIM is a direct implementation of the principle that a system should rank documents by their probability of relevance.
Connections
- Evolved into: BM25
- Contrast: Vector Space Model (geometric), Language Model for IR (generative)
- Assumptions: Term independence (similar to Naive Bayes).