Stemming

Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base, or root form. It is a heuristic process that “chops off” the ends of words.

Porter Stemmer

The most common algorithm for English. It applies a series of rules (phases) to iteratively strip suffixes.

  • Example:
    • connect, connected, connecting, connectionsconnect
    • runningrun (Note: some stemmers might result in run, others in runn)

Improving Recall

Stemming helps the retrieval system realize that a user searching for “stems” might also be interested in a document containing “stemming.” By mapping different word forms to the same root, we increase the number of matches.

Stemming vs. Lemmatization

FeatureStemmingLemmatization
ApproachHeuristic (chopping)Morphological analysis (lookup)
OutputMay not be a real word (comput)Always a valid word (compute)
SpeedVery FastSlower (needs dictionary)
ContextIgnores contextUses POS tags (e.g., saw as noun vs verb)

Trade-offs in IR

  • Increases Recall: More documents match the query terms.
  • Decreases Precision: May cause “over-stemming” where unrelated words are conflated (e.g., organization and organ might both stem to organ).

Connections

Appears In