Introduction
Part-of-speech (POS) tagging assigns grammatical categories (noun, verb, adjective, etc.) to words in sentences. Foundational NLP task: enables downstream parsing, semantic analysis, machine translation. Most NLP systems assume POS tags available.
Challenge: words often ambiguous. "Bank" noun (financial institution) or verb (tip over). "Read" verb (present tense) or past tense? Context disambiguates: "I read the book" (past), "I read every day" (present).
Historical evolution: manual tagging (precise, expensive), rule-based (fast, error-prone), statistical models (HMM, CRF), neural models (state-of-the-art). Modern systems: 96-97% accuracy on English, lower for morphologically complex languages.
"Part-of-speech tagging is the gateway to linguistic structure. Once you know what part of speech each word is, parsing and understanding follow." -- Christopher Manning, Stanford NLP
POS Categories and Tagsets
Universal POS Tags
12 universal categories across languages: NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PRT, X, PUNCT. Enables cross-lingual comparison. Language-independent layer.
Language-Specific Tagsets
Penn Treebank (English): 45 tags. Includes fine distinctions: NN (noun singular), NNS (noun plural), VB (verb base), VBD (verb past), VBG (verb gerund). Richer than universal (captures morphology), but language-specific.
Examples (Penn Treebank):
Noun tags: NN, NNS, NNP, NNPS (singular, plural, proper singular, proper plural)
Verb tags: VB, VBD, VBG, VBN, VBP, VBZ
Adjective: JJ, JJR, JJS (base, comparative, superlative)
Morphological Tagsets
Languages with rich morphology (Czech, German, Arabic) require fine-grained tags. Czech PDT: 1,500+ tag combinations (POS + case + number + gender + tense + aspect + mood + voice). Captures inflectional information.
Tag Granularity Trade-off
| Granularity | Examples | Pros | Cons |
|---|---|---|---|
| Universal (12) | NOUN, VERB, ADJ | Cross-lingual, simple | Loses morphology |
| Language (45) | NN, VBD, JJR | Captures inflection | Language-specific |
| Morphological (1000+) | NOUN.NOM.SING, VERB.PAST.3 | Full morphology | Data sparsity, complex |
The Tagging Ambiguity Problem
Structural Ambiguity
"Bank" → noun (financial) or verb (recline/tilt). Morphologically ambiguous. Most English words ambiguous: "run", "book", "play", "read".
Ambiguous words:
"book" - noun (I read a book) or verb (Book a flight)
"run" - noun (morning run) or verb (Run away) or adjective (run hose)
"play" - noun, verb, adjective
Context Resolution
Context disambiguates. "I book a flight" (verb). "A book on NLP" (noun). Requires understanding surrounding words, syntax, semantics.
Ambiguity Statistics
English: ~90% of word types ambiguous (appear as multiple POS in corpus). ~40% of word tokens ambiguous (could be multiple tags in context). Rare words less ambiguous (fewer observed as multiple tags).
Cascading Errors
Incorrect POS tag damages downstream tasks: parsing (syntax rules assume correct tags), NER (context affected), semantic role labeling. Early errors compound.
Rule-Based Tagging
Morphological Heuristics
Suffix patterns indicate POS: "-ly" → adverb (quickly), "-ing" → gerund or participle, "-ed" → past tense. Simple rule set covers common cases. ENGTWOL (English Two-Level) morphological analyzer uses 2,000+ rules.
Contextual Rules
Patterns disambiguate: article (the, a) precedes noun. Preposition (in, on) precedes noun phrase. "the X" → X noun. "V_ly" (adverb directly after verb) → adverb.
Rule examples:
IF previous word = DET THEN current = NOUN (with high probability)
IF word ends with -ly THEN = ADV
IF word is all caps THEN = NOUN (acronym)
Brill Tagging
Transformation-based learning: start with baseline (most-frequent tag per word), apply learned transformation rules. Rules score by error reduction: which transformation most improves training accuracy?
Example: "If previous tag = VERB and current = NN, change to VB". Learns transformations automatically from errors.
Limitations
Manual rules: labor-intensive, brittle. Domain-specific rules needed. Hybrid approaches (rules + statistics) often used in practice.
Statistical Foundations
Bayesian Framework
Find most likely tag sequence T = t_1, t_2, ..., t_n given word sequence W = w_1, ..., w_n:
T* = argmax P(T | W) = argmax P(W | T) * P(T) / P(W)
∝ argmax P(W | T) * P(T)
P(W | T): likelihood (which words generated by tag sequence)
P(T): prior (how likely tag sequence)
Independence Assumptions
Naive Bayes assumes independence: P(W | T) = product P(w_i | t_i). Unrealistic (words not independent), but approximation works. Markov chain assumes limited tag history: P(t_i | t_1,...,t_{i-1}) ≈ P(t_i | t_{i-1}).
Estimating Probabilities
Maximum Likelihood: P(w | t) = count(w, t) / count(t). Intuitive but zero-probability problem: unseen word-tag pairs get probability 0.
Smoothing: Add pseudocount (Laplace smoothing) or use backing off to lower-order n-grams.
Hidden Markov Model Tagging
HMM Framework
Hidden states: POS tags. Observable: words. Two probability distributions:
Transition probability: P(t_i | t_{i-1}) likelihood of tag sequence
Emission probability: P(w_i | t_i) likelihood of observing word given tag
P(W, T) = P(T) * P(W | T)
= product P(t_i | t_{i-1}) * P(w_i | t_i)
Viterbi Decoding
Find highest-probability tag sequence. Dynamic programming: compute maximum probability path to each (state, time) pair, backtrack to recover sequence. Efficient O(N^2 * |tags|) where N = sequence length.
Training
Maximum likelihood: count tag transitions and emissions in training data. Smoothing essential (rare transitions get very low probability). Typical: add-one smoothing or backoff.
Performance
HMM baseline: ~95-96% on English (Penn Treebank). Simple, efficient, competitive for many tasks. Limitations: no long-range dependencies (Markov assumption), independence of emissions.
Trigram HMM
Extend: P(t_i | t_{i-2}, t_{i-1}). More context, better performance (~97%). Increased complexity and data sparsity. Rare trigrams smoothed via backing off to bigram, unigram.
Conditional Random Fields for POS
CRF vs. HMM
HMM generative: models P(W, T). CRF discriminative: models P(T | W). CRF conditions on observations, avoids modeling word distribution (unnecessary). More flexible feature engineering.
CRF Features
Transition features: t_{i-1} = VB, t_i = NN
Emission features: w_i = "quickly", t_i = ADV
Enriched features: w_i ends with -ly, t_i = ADV
Complex features: (w_{i-1} = "run", t_{i-1} = NN, t_i = VB)
Training and Inference
Training: maximize conditional likelihood (L-BFGS optimization, L2 regularization). Inference: Viterbi algorithm (same as HMM). CRF slightly slower than HMM (more features to compute).
Performance
CRF: ~96.5-97% English. Better than HMM. More features allow finer modeling. Standard approach pre-neural era. Competitive today when engineered features provided.
Hybrid Systems
Practical systems: CRF backbone + hand-crafted rules (disambiguate rare cases). Rules override CRF for special cases (acronyms, numbers). Hybrid achieves 97%+ accuracy.
Neural Approaches
Bidirectional LSTM
Process word sequence both directions. At each position, hidden state h_i combines forward context (w_1...w_i) and backward context (w_i...w_n). Better than unidirectional.
BiLSTM:
Forward LSTM: w1 → hf1, w2 → hf2, ...
Backward LSTM: wn → hb_n, ..., w1 → hb1
Output: h_i = [hf_i ; hb_i] (concatenate)
Tagging: softmax(W * h_i)
Character Embeddings
Words encoded as character sequences. CNN or LSTM processes characters → word embedding. Captures morphology: "quickly" = "quick" + "ly". Handles OOV gracefully.
Multi-Task Learning
Joint training on multiple tasks (POS + chunking + NER). Shared representations learn linguistic structure. Improves performance especially for low-resource languages.
Transformer Models
BERT fine-tuned for POS achieves 97.5%+ English. Massive pre-training on unlabeled text provides strong initialization. Contextualized embeddings capture word sense from context.
Performance Comparison
| Model | English Accuracy | Speed | Data Required |
|---|---|---|---|
| Rule-Based | ~90% | Very fast | None (hand-crafted) |
| HMM | 95-96% | Fast | Minimal (counts) |
| CRF | 96-97% | Moderate | Modest (training) |
| BiLSTM | 97-97.5% | Slow | Moderate |
| BERT | 97.5%+ | Slow (inference) | Minimal (fine-tuning) |
Morphological Tagging and Inflection
Morphologically Rich Languages
Languages like Czech, Hungarian, Turkish, Arabic have complex inflection. Single word encodes multiple attributes (case, number, gender, tense, aspect, mood). Word form ambiguity high.
Czech example: "kniha" (book nominative singular), "knihy" (books nominative plural or genitive singular). POS tags include morphological features: NOUN;NOM;SING vs. NOUN;NOM;PL.
Morphological Analysis
Pre-process: decompose word into morphemes. "Unkindly" → "un-" + "kind" + "-ly". Increases feature space, helps models understand word structure.
Joint Tagging and Lemmatization
Jointly predict POS tag and lemma (base form). "Books" → tag: NOUN, lemma: "book". Richer output than POS alone, useful for downstream tasks.
Cross-Lingual Transfer
Train on high-resource language (English), test on low-resource (rare language). Challenges: different morphology, writing systems. mBERT enables cross-lingual: single model for 100+ languages.
Multilingual and Cross-Lingual POS
Universal Dependencies
Annotation scheme covering 100+ languages consistently. Universal POS tags (NOUN, VERB, etc.) + language-specific features. Enables cross-lingual comparison and zero-shot transfer.
Zero-Shot POS Tagging
Train on English, test on unseen language (e.g., Telugu). Transfer via shared embeddings. Performance degrades but non-trivial transfer occurs. mBERT enables this: 40-70% accuracy on zero-shot (vs. 95% with in-language data).
Multilingual Models
mBERT, XLM-R: pre-trained on 100+ languages jointly. Single model tags any language. Cross-lingual regularization: similar words in different languages share representations. Improves low-resource performance.
Language-Specific Challenges
- Agglutination (Turkish, Finnish): Single word multiple morphemes. Segmentation challenge.
- No spaces (Chinese, Japanese): Word segmentation first, then POS tagging.
- Rich case system (Czech, German): Many fine-grained POS tags.
- Morphological ambiguity: Same form multiple tags depending on context.
Applications
Syntactic Parsing
POS tags essential for parsing. Parsers assume correct tags. Tag errors propagate. "Bank/NN" vs. "Bank/VB" affects parse structure (noun vs. verb predicate).
Information Extraction
NER uses POS: capitalized noun likely person. Verb-object patterns identify relations: "acquire company" (acquirer=subject, company=object).
Speech Synthesis
Pronunciation depends on POS: "read" (present verb) pronounced /rid/, "read" (past) pronounced /red/. TTS systems use POS for correct pronunciation.
Lemmatization
Reducing to base form requires POS: "running" → "run" (verb form). "Running" alone ambiguous (noun or verb).
Machine Translation
Word order in translation depends on POS. Noun phrases organized differently in SOV vs. SVO languages. Improves alignment.
Corpus Linguistics
Search by POS pattern: find all adjective-noun sequences, compare frequencies across languages. Understanding language structure, variation.
References
- Manning, C. D., and Schutze, H. "Foundations of Statistical Natural Language Processing." MIT Press, 1999.
- Brill, E. "A Simple Rule-Based Part of Speech Tagger." ANLC, 1992.
- Liang, P. "Semi-Supervised Learning for Natural Language." PhD dissertation, MIT, 2005.
- Huang, Z., Xu, W., and Yu, K. "Bidirectional LSTM-CRF Models for Sequence Tagging." arXiv:1508.01991, 2015.
- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. "BERT: Pre-training of Deep Bidirectional Transformers." NAACL, 2019.