Introduction
Named Entity Recognition (NER) identifies and classifies named entities (proper nouns referring to specific entities). Examples: persons (John, Obama), organizations (Google, WHO), locations (Paris, Nigeria), dates, times, monetary values, percentages. Foundational for information extraction: automatically structure unstructured text.
Applications: news article processing ("President Biden visited Paris"), clinical records ("Patient John Smith, diagnosed with diabetes"), biomedical research (gene/protein name extraction), resume parsing (identify candidate skills, experience), question answering (identify entities mentioned in documents to answer "Who is...?" queries).
NER challenging: ambiguity (John could be first name or last name), nested entities (organizations within locations), domain variation (medical entities differ from news entities), multilingual (different writing systems, morphology).
"Named entity recognition is the gateway to information extraction. Identify entities precisely, and you unlock structured knowledge from unstructured text." -- Ellen Riloff, NLP researcher
Task Definition and Entity Types
Entity Categories
| Category | Examples | Challenge |
|---|---|---|
| Person (PER) | John, Maria, President Obama | Titles (President), nicknames (Bob for Robert) |
| Organization (ORG) | Google, WHO, Apple Inc. | Acronyms (WHO), affixes (Inc., Ltd.) |
| Location (LOC) | Paris, Nigeria, Mount Everest | Nested (Paris is in France), common words (Mount) |
| Miscellaneous (MISC) | Monday, 5%, $1000, COVID-19 | Domain-dependent (what counts as entity) |
Fine-Grained vs. Coarse NER
Coarse: person, organization, location, miscellaneous (4 categories). Fine-grained: product, event, artifact, disease, drug (10+ categories). Fine-grained more useful but harder (few training examples per class).
Nested Entities
"New York University" could be: (ORG: "New York University") or nested (LOC: "New York", ORG: "University"). Standard BIO tagging can't represent nesting. Requires special schemes or multi-layer annotations.
Entity Boundaries
Where does entity start/end? "President Obama" includes title? "Apple Inc." includes "Inc."? Annotation guidelines necessary, but still subjective.
Tagging Schemes (BIO, BIOES, IOBES)
BIO Tagging
Each word tagged with entity type and position: B-TYPE (beginning), I-TYPE (inside), O (outside). Enables boundary detection and entity classification simultaneously.
Sentence: "John works at Google in Paris"
Words: John works at Google in Paris
Tags: B-PER O O B-ORG O B-LOC
Entities identified: John (PER), Google (ORG), Paris (LOC)
BIOES Tagging (IOB2)
Extended scheme: B (beginning), I (inside), O (outside), E (end), S (singleton). More explicit. E and S reduce ambiguity. "San Francisco" tokenized as "San" (B-LOC), "Francisco" (E-LOC). Two-token entity clear.
"John Smith works at Google"
B-PER (John)
E-PER (Smith) [two-token person]
O (works, at)
B-ORG (Google) or S-ORG if single-token
IOBES for Disambiguation
BIO ambiguous: "I-PER" could be continuation or error. IOBES explicit: E marks entity end. Decoder knows entity boundaries exactly. Slightly improved performance (1-2 percentage points).
Pros and Cons
BIO: simpler, fewer tags, standard. BIOES: clearer boundaries, better error detection, slightly more tags. Most systems use BIO (standard, benchmarks use it). BIOES gaining adoption.
Sequence Labeling as Structured Prediction
Structured Prediction
Unlike classification (independent labels), NER predicts label sequence with constraints. Valid sequences: B-PER followed by I-PER (same entity) or O (new entity). Invalid: I-PER followed by B-LOC without O separator (overlapping entities).
Sequence Models
Models capture label dependencies. Hidden Markov Model: P(y | x) = product of emission and transition probabilities. CRF: discriminative sequence model. Neural: RNN, transformer.
Why Sequence Matters
Local prediction: "Google" could be company (ORG) or verb (O). Context helps: "works at Google" → ORG likely. Sequence models exploit dependencies: previous labels inform current decision.
Local (word-level): P(label_t | word_t, features)
Sequence: P(label_t | label_{t-1}, label_{t-2}, ..., word_t, context)
Feature Engineering for NER
Lexical Features
Word properties: capitalization (indicates named entity), word shape (number, letters, mixed), length, presence in gazetteers (lists of known entities).
Features for "Google":
- Capitalized: yes
- All caps: no
- Title case: yes
- In company list: yes
- In person list: no
Gazetteers
Lists of known entities. Person names, company names, cities. Feature: word in gazetteer → likely entity. Gazetteer existence signals entity but not guarantee (ambiguity: "Bank" is company name and common noun).
Contextual Features
Surrounding words: POS tags, word embeddings, syntactic context. "Works at X" pattern suggests X is location/organization. "Is a Y" suggests Y is profession/entity type.
Character-Level Features
Affixes: "Smith" ends with "-man" or "-son" (person name pattern). Morphology: "Googling" contains "Google".
External Knowledge
Wikipedia links: "John" article links to "John (name)". WordNet: synsets. Word2vec: semantic similarity (John near John, Mary near Jane).
Conditional Random Fields (CRF)
CRF Formulation
Probabilistic undirected graphical model. Scores label sequence y given observation x. Linear-chain CRF:
P(y | x) = (1/Z(x)) * exp(sum_t sum_k lambda_k f_k(y_t, y_{t-1}, x, t))
f_k: feature functions (e.g., "word is Google and label is ORG")
lambda_k: learned weights
Z(x): normalization constant
Advantages Over HMM
CRF discriminative: models P(y|x) directly (vs. HMM generative P(x,y)). Integrates multiple features naturally. No independence assumptions. Feature weights learned jointly.
Feature Functions
Transition features: y_{t-1}=I-PER, y_t=B-LOC (invalid transition, low weight). Emission features: word="Google", y_t=B-ORG (high weight).
Training and Inference
Training: maximize likelihood with L2 regularization (prevent overfitting). Inference: Viterbi algorithm finds highest-scoring label sequence. O(T * |S|^2) where T = sequence length, |S| = state space.
Performance
CRF standard approach (pre-neural): ~90% F1 on CoNLL 2003 English NER. Still competitive with neural models when good features provided.
Neural NER Models
Feedforward Neural Network
Window-based: context words → embedding → concatenate → hidden layer → softmax. Simple: "John", "works" → predict label for "works". Window typically 3-5 words. Ignores long-range dependencies.
RNN for NER
Process word sequence sequentially. At each position, hidden state carries context. BiRNN (bidirectional): forward pass captures left context, backward captures right. Better than unidirectional.
BiLSTM:
Forward LSTM: word1 → h1_fwd, word2 → h2_fwd, ...
Backward LSTM: word_n → h_n_bwd, ..., word1 → h1_bwd
Combine: h_t = [h_t_fwd; h_t_bwd]
Classification: linear layer on h_t
Character-Level Embeddings
Words represented as character sequence. CNN or LSTM over characters produces word embedding. Captures morphology: "running" = "run" + "-ning". Handles OOV: rare words still have embedding (character composition).
Word Embeddings
Pre-trained (Word2vec, GloVe) or learned. Pre-trained transfer: embeddings from large corpus help. Learned: task-specific, context-aware. Typical: combine pre-trained + character.
BiLSTM-CRF Architecture
Architecture
Input: word sequence [w1, w2, ..., wn]
Word embeddings: [e1, e2, ..., en]
BiLSTM: forward + backward → [h1, h2, ..., hn]
CRF layer: scores label sequences considering transitions
Output: y = argmax P(y | h)
BiLSTM Component
LSTM processes words, maintains hidden state. Bidirectional: captures context from both directions. At position t, h_t = [LSTM_forward(t); LSTM_backward(t)]. Combines past and future context.
CRF Component
Replaces simple softmax. Instead of independent label probabilities, scores entire sequence. Enforces consistency: B-PER → I-PER valid, B-PER → B-LOC without O invalid.
Decoding
Viterbi algorithm finds best label sequence. Unlike softmax (greedy label-by-label), Viterbi considers full sequence optimality.
Performance
BiLSTM-CRF state-of-the-art pre-transformer: ~90-91% F1 CoNLL 2003. Combines neural expressiveness (LSTM) with structured prediction (CRF). Standard baseline.
Transformer-Based NER
BERT for NER
Fine-tune pre-trained BERT. Feed word embeddings to BERT → contextual representations. Linear classifier on top predicts label per token.
Input: [CLS] John works at Google [SEP]
BERT: contextual embeddings for each token
NER head: linear layer on each token embedding → label
Advantages Over BiLSTM
- Pre-training: BERT learned on massive corpus. Transfer learning powerful.
- Long-Range Context: Self-attention captures dependencies across entire sequence. BiLSTM effective range ~100-200 tokens; transformers ~500+.
- Parallelization: Transformers parallel (all tokens attend simultaneously). RNNs sequential. Faster training.
- Performance: 92-94% F1 (BERT-based) vs. 90-91% (BiLSTM-CRF). Significant improvement.
Challenges
Subword tokenization: transformer tokenizes "United States" as "United", "States" → two tokens. Entity spanning multiple subword tokens requires merging predictions. Standard approach: first subword token label = entity label.
Multilingual NER
mBERT, XLM-R pre-trained on 100+ languages. Single model handles multiple languages. Zero-shot: train on English, test on Korean (transfers). Cross-lingual transfer works!
Entity Linking and Disambiguation
NER vs. Entity Linking
NER: identify entity mentions in text. Entity Linking: link mentions to knowledge base entities (Wikipedia articles). "Paris" NER identifies mention. Entity Linking: which Paris (city, person, mythology)?
Entity Linking Process
1. NER identifies "Paris" as location
2. Candidate generation: find Wikipedia articles about Paris
- Paris, France (city)
- Paris, Texas (city)
- Paris (mythology, Trojan prince)
3. Ranking: context "visited Paris last summer" → Paris, France likely
4. Link: Wikipedia page ID
Disambiguation
Multiple candidate entities. Resolve via context: surrounding entities, topic, common sense. "John visited Paris" → person John, location Paris less ambiguous than standalone "Paris".
Knowledge Bases
Wikipedia: comprehensive, open, freely available. Wikidata: structured knowledge (RDF). Google Knowledge Graph: proprietary, comprehensive. Choice affects link coverage and quality.
Applications
Question answering: "Who wrote Harry Potter?" → extract NER "J.K. Rowling" → link to Wikipedia → answer. Information extraction: extract (subject, relation, object) triples for knowledge base.
Challenges and Limitations
Domain Adaptation
Model trained on news entities (CoNLL 2003) fails on biomedical, social media. Different entity types, distribution. Few-shot transfer emerging but still challenging.
Nested Entities
Standard BIO can't represent nesting. "New York University" nested: LOC "New York" + ORG "University". Solutions: multiple layers, boundary-based approaches (identify entity boundaries separately).
Rare Entities
New entities, misspellings, novel compounds. Embedding-based approaches help but still struggle. Data augmentation, pre-training on larger corpora help.
Cross-Lingual Transfer
mBERT enables zero-shot, but performance gaps exist. High-resource English models don't transfer perfectly to low-resource languages (different morphology, scripts).
Entity Ambiguity
"Apple" company or fruit? Context helps but sometimes insufficient. Requires external knowledge (gazetteer, Wikipedia links).
Annotation Quality
NER requires manual annotation. Inter-annotator agreement often 85-95% (disagreement even among humans). Subjective boundaries (should title be included?).
Applications
Information Extraction
Extract structured information: events (who, when, where), relations (person works at organization). Enables knowledge base construction.
Question Answering
Identify relevant entities answering question. "When was Einstein born?" → extract person entity, find birth date.
Biomedical NER
Extract drug names, diseases, genes from medical literature. BioNER more fine-grained (10+ entity types). Enables drug discovery, disease tracking.
Social Media Monitoring
Extract mentions of entities (brands, competitors, products). Track sentiment toward specific entities.
Resume Parsing
Extract names, companies, skills, dates from resumes. Populate structured candidate profiles.
Content Recommendation
Extract user interests (entities mentioned in browsing history). Recommend content on related entities.
References
- Nadeau, D., and Sekine, S. "A Survey of Named Entity Recognition and Classification." Lingvisticae Investigationes, vol. 30, 2007.
- Lafferty, J., McCallum, A., and Pereira, F. C. "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data." ICML, 2001.
- Huang, Z., Xu, W., and Yu, K. "Bidirectional LSTM-CRF Models for Sequence Tagging." arXiv:1508.01991, 2015.
- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. "BERT: Pre-training of Deep Bidirectional Transformers." NAACL, 2019.
- Tjong Kim Sang, E. F., and De Meulder, F. "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition." CoNLL, 2003.