TF-IDF

TF-IDF

TF-IDF (Term Frequency–Inverse Document Frequency) is a term weighting scheme that reflects how important a word is to a document in a collection. It combines two intuitions: terms that appear frequently in a document are important (TF), and terms that appear in few documents are more discriminative (IDF).

TF-IDF Weight

where:

  • — term frequency (count of term in document , or log-scaled: )
  • — inverse document frequency ( = total docs, = docs containing )

TF Variants

VariantFormulaBehavior
RawLinear with count
Log-scaledSublinear — diminishing returns
Boolean if else Presence/absence only
AugmentedNormalized by max TF

Scoring with TF-IDF

Documents and queries are represented as TF-IDF weighted vectors in the Vector Space Model. Similarity is computed via cosine:

Connections

Appears In