Tokenization

Tokenization

Tokenization is the process of breaking a stream of text into smaller units called tokens (e.g., words, phrases, symbols). In IR, tokens are the candidates for becoming terms in the Inverted Index.

Challenges in Tokenization

Splitting by whitespace is rarely enough. Key issues include:

  • Punctuation: “O’Neill” → O'Neill? O and Neill?
  • Hyphenation: “state-of-the-art” → one token or four?
  • Compounds: “database” (English) vs “Datimbank” (German) vs “San Francisco” (multi-word expression).
  • Numbers/Dates: Handling 2024-02-24 or $1,000.50.
  • Case Folding: Reducing everything to lowercase (e.g., “Apple” vs “apple”).

Normalization Steps

After splitting, tokens often undergo further normalization:

  • Lowercasing: Standardizing case.
  • Accents: Stripping diacritics (e.g., résuméresume).
  • Standardization: U.K.UK.

Connections

Appears In