Skip to content

Latest commit

 

History

History
40 lines (30 loc) · 1009 Bytes

section_stemming.md

File metadata and controls

40 lines (30 loc) · 1009 Bytes

Stemming

Notes:

Stemming

  • Reduce word endings to find common stem:
    • housing, houses, househous
  • Heuristic approach, not linguistic
  • May find common stem for different concepts:
    • organicorgan

Notes:

Porter Stemmer

Implemented in Snowball Stemming DSL

Match Replace Example
SSES SS caresses → caress
IES I ponies → poni, ties → ti
SS SS caress → caress
S empty cats → cat

… and so on

Purely algorithmitc, no linguistic knowledge. But usually good enough.

Notes:

Lemmatization

  • Determine root based on linguistic rules
  • Keeps the type of word:
    • A sawsaw, I sawsee
  • Can generate inflections, e.g., what's the plural of house?
  • Benefits of lemmatization over stemming doubtful

Notes: