Part of Speech Tagging | What's Your IQ

Introduction

Part-of-speech (POS) tagging assigns grammatical categories (noun, verb, adjective, etc.) to words in sentences. Foundational NLP task: enables downstream parsing, semantic analysis, machine translation. Most NLP systems assume POS tags available.

Challenge: words often ambiguous. "Bank" noun (financial institution) or verb (tip over). "Read" verb (present tense) or past tense? Context disambiguates: "I read the book" (past), "I read every day" (present).

Historical evolution: manual tagging (precise, expensive), rule-based (fast, error-prone), statistical models (HMM, CRF), neural models (state-of-the-art). Modern systems: 96-97% accuracy on English, lower for morphologically complex languages.

"Part-of-speech tagging is the gateway to linguistic structure. Once you know what part of speech each word is, parsing and understanding follow." -- Christopher Manning, Stanford NLP

POS Categories and Tagsets

Universal POS Tags

12 universal categories across languages: NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PRT, X, PUNCT. Enables cross-lingual comparison. Language-independent layer.

Language-Specific Tagsets

Penn Treebank (English): 45 tags. Includes fine distinctions: NN (noun singular), NNS (noun plural), VB (verb base), VBD (verb past), VBG (verb gerund). Richer than universal (captures morphology), but language-specific.

Examples (Penn Treebank):
Noun tags: NN, NNS, NNP, NNPS (singular, plural, proper singular, proper plural)
Verb tags: VB, VBD, VBG, VBN, VBP, VBZ
Adjective: JJ, JJR, JJS (base, comparative, superlative)

Morphological Tagsets

Languages with rich morphology (Czech, German, Arabic) require fine-grained tags. Czech PDT: 1,500+ tag combinations (POS + case + number + gender + tense + aspect + mood + voice). Captures inflectional information.

Tag Granularity Trade-off

Granularity	Examples	Pros	Cons
Universal (12)	NOUN, VERB, ADJ	Cross-lingual, simple	Loses morphology
Language (45)	NN, VBD, JJR	Captures inflection	Language-specific
Morphological (1000+)	NOUN.NOM.SING, VERB.PAST.3	Full morphology	Data sparsity, complex

The Tagging Ambiguity Problem

Structural Ambiguity

"Bank" → noun (financial) or verb (recline/tilt). Morphologically ambiguous. Most English words ambiguous: "run", "book", "play", "read".

Ambiguous words:
"book" - noun (I read a book) or verb (Book a flight)
"run" - noun (morning run) or verb (Run away) or adjective (run hose)
"play" - noun, verb, adjective

Context Resolution

Context disambiguates. "I book a flight" (verb). "A book on NLP" (noun). Requires understanding surrounding words, syntax, semantics.

Ambiguity Statistics

English: ~90% of word types ambiguous (appear as multiple POS in corpus). ~40% of word tokens ambiguous (could be multiple tags in context). Rare words less ambiguous (fewer observed as multiple tags).

Cascading Errors

Incorrect POS tag damages downstream tasks: parsing (syntax rules assume correct tags), NER (context affected), semantic role labeling. Early errors compound.

Rule-Based Tagging

Morphological Heuristics

Suffix patterns indicate POS: "-ly" → adverb (quickly), "-ing" → gerund or participle, "-ed" → past tense. Simple rule set covers common cases. ENGTWOL (English Two-Level) morphological analyzer uses 2,000+ rules.

Contextual Rules

Patterns disambiguate: article (the, a) precedes noun. Preposition (in, on) precedes noun phrase. "the X" → X noun. "V_ly" (adverb directly after verb) → adverb.

Rule examples:
IF previous word = DET THEN current = NOUN (with high probability)
IF word ends with -ly THEN = ADV
IF word is all caps THEN = NOUN (acronym)

Brill Tagging

Transformation-based learning: start with baseline (most-frequent tag per word), apply learned transformation rules. Rules score by error reduction: which transformation most improves training accuracy?

Example: "If previous tag = VERB and current = NN, change to VB". Learns transformations automatically from errors.

Limitations

Manual rules: labor-intensive, brittle. Domain-specific rules needed. Hybrid approaches (rules + statistics) often used in practice.

Statistical Foundations

Bayesian Framework

Find most likely tag sequence T = t_1, t_2, ..., t_n given word sequence W = w_1, ..., w_n:

T* = argmax P(T | W) = argmax P(W | T) * P(T) / P(W)
 ∝ argmax P(W | T) * P(T)

P(W | T): likelihood (which words generated by tag sequence)
P(T): prior (how likely tag sequence)

Independence Assumptions

Naive Bayes assumes independence: P(W | T) = product P(w_i | t_i). Unrealistic (words not independent), but approximation works. Markov chain assumes limited tag history: P(t_i | t_1,...,t_{i-1}) ≈ P(t_i | t_{i-1}).

Estimating Probabilities

Maximum Likelihood: P(w | t) = count(w, t) / count(t). Intuitive but zero-probability problem: unseen word-tag pairs get probability 0.

Smoothing: Add pseudocount (Laplace smoothing) or use backing off to lower-order n-grams.

Hidden Markov Model Tagging

HMM Framework

Hidden states: POS tags. Observable: words. Two probability distributions:

Transition probability: P(t_i | t_{i-1}) likelihood of tag sequence
Emission probability: P(w_i | t_i) likelihood of observing word given tag

P(W, T) = P(T) * P(W | T)
 = product P(t_i | t_{i-1}) * P(w_i | t_i)

Viterbi Decoding

Find highest-probability tag sequence. Dynamic programming: compute maximum probability path to each (state, time) pair, backtrack to recover sequence. Efficient O(N^2 * |tags|) where N = sequence length.

Training

Maximum likelihood: count tag transitions and emissions in training data. Smoothing essential (rare transitions get very low probability). Typical: add-one smoothing or backoff.

Performance

HMM baseline: ~95-96% on English (Penn Treebank). Simple, efficient, competitive for many tasks. Limitations: no long-range dependencies (Markov assumption), independence of emissions.

Trigram HMM

Extend: P(t_i | t_{i-2}, t_{i-1}). More context, better performance (~97%). Increased complexity and data sparsity. Rare trigrams smoothed via backing off to bigram, unigram.

Conditional Random Fields for POS

CRF vs. HMM

HMM generative: models P(W, T). CRF discriminative: models P(T | W). CRF conditions on observations, avoids modeling word distribution (unnecessary). More flexible feature engineering.

CRF Features

Transition features: t_{i-1} = VB, t_i = NN
Emission features: w_i = "quickly", t_i = ADV
Enriched features: w_i ends with -ly, t_i = ADV
Complex features: (w_{i-1} = "run", t_{i-1} = NN, t_i = VB)

Training and Inference

Training: maximize conditional likelihood (L-BFGS optimization, L2 regularization). Inference: Viterbi algorithm (same as HMM). CRF slightly slower than HMM (more features to compute).

Performance

CRF: ~96.5-97% English. Better than HMM. More features allow finer modeling. Standard approach pre-neural era. Competitive today when engineered features provided.

Hybrid Systems

Practical systems: CRF backbone + hand-crafted rules (disambiguate rare cases). Rules override CRF for special cases (acronyms, numbers). Hybrid achieves 97%+ accuracy.

Neural Approaches

Bidirectional LSTM

Process word sequence both directions. At each position, hidden state h_i combines forward context (w_1...w_i) and backward context (w_i...w_n). Better than unidirectional.

BiLSTM:
Forward LSTM: w1 → hf1, w2 → hf2, ...
Backward LSTM: wn → hb_n, ..., w1 → hb1
Output: h_i = [hf_i ; hb_i] (concatenate)
Tagging: softmax(W * h_i)

Character Embeddings

Words encoded as character sequences. CNN or LSTM processes characters → word embedding. Captures morphology: "quickly" = "quick" + "ly". Handles OOV gracefully.

Multi-Task Learning

Joint training on multiple tasks (POS + chunking + NER). Shared representations learn linguistic structure. Improves performance especially for low-resource languages.

Transformer Models

BERT fine-tuned for POS achieves 97.5%+ English. Massive pre-training on unlabeled text provides strong initialization. Contextualized embeddings capture word sense from context.

Performance Comparison

Model	English Accuracy	Speed	Data Required
Rule-Based	~90%	Very fast	None (hand-crafted)
HMM	95-96%	Fast	Minimal (counts)
CRF	96-97%	Moderate	Modest (training)
BiLSTM	97-97.5%	Slow	Moderate
BERT	97.5%+	Slow (inference)	Minimal (fine-tuning)

Morphological Tagging and Inflection

Morphologically Rich Languages

Languages like Czech, Hungarian, Turkish, Arabic have complex inflection. Single word encodes multiple attributes (case, number, gender, tense, aspect, mood). Word form ambiguity high.

Czech example: "kniha" (book nominative singular), "knihy" (books nominative plural or genitive singular). POS tags include morphological features: NOUN;NOM;SING vs. NOUN;NOM;PL.

Morphological Analysis

Pre-process: decompose word into morphemes. "Unkindly" → "un-" + "kind" + "-ly". Increases feature space, helps models understand word structure.

Joint Tagging and Lemmatization

Jointly predict POS tag and lemma (base form). "Books" → tag: NOUN, lemma: "book". Richer output than POS alone, useful for downstream tasks.

Cross-Lingual Transfer

Train on high-resource language (English), test on low-resource (rare language). Challenges: different morphology, writing systems. mBERT enables cross-lingual: single model for 100+ languages.

Multilingual and Cross-Lingual POS

Universal Dependencies

Annotation scheme covering 100+ languages consistently. Universal POS tags (NOUN, VERB, etc.) + language-specific features. Enables cross-lingual comparison and zero-shot transfer.

Zero-Shot POS Tagging

Train on English, test on unseen language (e.g., Telugu). Transfer via shared embeddings. Performance degrades but non-trivial transfer occurs. mBERT enables this: 40-70% accuracy on zero-shot (vs. 95% with in-language data).

Multilingual Models

mBERT, XLM-R: pre-trained on 100+ languages jointly. Single model tags any language. Cross-lingual regularization: similar words in different languages share representations. Improves low-resource performance.

Language-Specific Challenges

Agglutination (Turkish, Finnish): Single word multiple morphemes. Segmentation challenge.
No spaces (Chinese, Japanese): Word segmentation first, then POS tagging.
Rich case system (Czech, German): Many fine-grained POS tags.
Morphological ambiguity: Same form multiple tags depending on context.

Applications

Syntactic Parsing

POS tags essential for parsing. Parsers assume correct tags. Tag errors propagate. "Bank/NN" vs. "Bank/VB" affects parse structure (noun vs. verb predicate).

Information Extraction

NER uses POS: capitalized noun likely person. Verb-object patterns identify relations: "acquire company" (acquirer=subject, company=object).

Speech Synthesis

Pronunciation depends on POS: "read" (present verb) pronounced /rid/, "read" (past) pronounced /red/. TTS systems use POS for correct pronunciation.

Lemmatization

Reducing to base form requires POS: "running" → "run" (verb form). "Running" alone ambiguous (noun or verb).

Machine Translation

Word order in translation depends on POS. Noun phrases organized differently in SOV vs. SVO languages. Improves alignment.

Corpus Linguistics

Search by POS pattern: find all adjective-noun sequences, compare frequencies across languages. Understanding language structure, variation.

References

Manning, C. D., and Schutze, H. "Foundations of Statistical Natural Language Processing." MIT Press, 1999.
Brill, E. "A Simple Rule-Based Part of Speech Tagger." ANLC, 1992.
Liang, P. "Semi-Supervised Learning for Natural Language." PhD dissertation, MIT, 2005.
Huang, Z., Xu, W., and Yu, K. "Bidirectional LSTM-CRF Models for Sequence Tagging." arXiv:1508.01991, 2015.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. "BERT: Pre-training of Deep Bidirectional Transformers." NAACL, 2019.