Term Weighting
Term Weighting
Term Weighting is the process of assigning a numerical value to a term in a document or query to represent its importance for retrieval. In effective weighting schemes, terms that are descriptive of the document’s content receive high weights, while common or “noisy” terms receive low weights.
Classic TF-IDF Weighting
The most fundamental weighting scheme is TF-IDF:
where:
- (Term Frequency): How many times term appears in document .
- (Inverse Document Frequency): , where is the total number of docs and is the number of docs containing .
The Local vs. Global Balance
Good weighting balances two signals:
- Local importance (TF): If a word appears many times in this document, it likely describes what this document is about.
- Global rarity (IDF): If a word appears in every document (like “the” or “is”), it’s useless for distinguishing between them. We want words that are specific to a few documents.
Main Weighting Components
| Method | Intuition |
|---|---|
| TF (Term Frequency) | More occurrences = more relevance. |
| IDF (Inv. Doc Frequency) | Rare words are better signals than common ones. |
| Length Normalization | Prevents long documents from winning just by having more words. |
| Learned Weights | Modern neural methods (like DeepCT) use BERT for IR to predict weights based on context. |
Connections
- Core models: BM25 (standard modern weighting), TF-IDF (classic)
- Neural variants: DeepCT, DeepImpact, uniCOIL
- Usage: Fundamental step in building an Inverted Index.