The earliest algorithms for automatically assigning part-of-speech were based
on a two-stage architecture (Harris, 1962; Klein and Simmons, 1963; Greene
and Rubin, 1971). The first stage used a dictionary to assign each word a list
of potential parts of speech. The second stage used large lists of hand-written
disambiguation rules to winnow down this list to a single part-of-speech for
each word.
ENGTWOL The ENGTWOL tagger (Voutilainen, 1995) is based on the same twostage
architecture, although both the lexicon and the disambiguation rules
are much more sophisticated than the early algorithms. The ENGTWOL
lexicon is based on the two-level morphology described in Chapter 3, and
has about 56,000 entries for English word stems (Heikkila, 1995), counting
a word with multiple parts of speech (e.g. nominal and verbal senses of hit)
as separate entries, and of course not counting inflected and many derived
forms. Each entry is annotated with a set of morphological and syntactic
features. Figure 8.8 shows some selected words, together with a slightly
simplified listing of their features.