Introduction

Named Entity Recognition (NER) identifies and classifies named entities (proper nouns referring to specific entities). Examples: persons (John, Obama), organizations (Google, WHO), locations (Paris, Nigeria), dates, times, monetary values, percentages. Foundational for information extraction: automatically structure unstructured text.

Applications: news article processing ("President Biden visited Paris"), clinical records ("Patient John Smith, diagnosed with diabetes"), biomedical research (gene/protein name extraction), resume parsing (identify candidate skills, experience), question answering (identify entities mentioned in documents to answer "Who is...?" queries).

NER challenging: ambiguity (John could be first name or last name), nested entities (organizations within locations), domain variation (medical entities differ from news entities), multilingual (different writing systems, morphology).

"Named entity recognition is the gateway to information extraction. Identify entities precisely, and you unlock structured knowledge from unstructured text." -- Ellen Riloff, NLP researcher

Task Definition and Entity Types

Entity Categories

Category Examples Challenge
Person (PER) John, Maria, President Obama Titles (President), nicknames (Bob for Robert)
Organization (ORG) Google, WHO, Apple Inc. Acronyms (WHO), affixes (Inc., Ltd.)
Location (LOC) Paris, Nigeria, Mount Everest Nested (Paris is in France), common words (Mount)
Miscellaneous (MISC) Monday, 5%, $1000, COVID-19 Domain-dependent (what counts as entity)

Fine-Grained vs. Coarse NER

Coarse: person, organization, location, miscellaneous (4 categories). Fine-grained: product, event, artifact, disease, drug (10+ categories). Fine-grained more useful but harder (few training examples per class).

Nested Entities

"New York University" could be: (ORG: "New York University") or nested (LOC: "New York", ORG: "University"). Standard BIO tagging can't represent nesting. Requires special schemes or multi-layer annotations.

Entity Boundaries

Where does entity start/end? "President Obama" includes title? "Apple Inc." includes "Inc."? Annotation guidelines necessary, but still subjective.

Tagging Schemes (BIO, BIOES, IOBES)

BIO Tagging

Each word tagged with entity type and position: B-TYPE (beginning), I-TYPE (inside), O (outside). Enables boundary detection and entity classification simultaneously.

Sentence: "John works at Google in Paris"
Words: John works at Google in Paris
Tags: B-PER O O B-ORG O B-LOC

Entities identified: John (PER), Google (ORG), Paris (LOC)

BIOES Tagging (IOB2)

Extended scheme: B (beginning), I (inside), O (outside), E (end), S (singleton). More explicit. E and S reduce ambiguity. "San Francisco" tokenized as "San" (B-LOC), "Francisco" (E-LOC). Two-token entity clear.

"John Smith works at Google"
B-PER (John)
E-PER (Smith) [two-token person]
O (works, at)
B-ORG (Google) or S-ORG if single-token

IOBES for Disambiguation

BIO ambiguous: "I-PER" could be continuation or error. IOBES explicit: E marks entity end. Decoder knows entity boundaries exactly. Slightly improved performance (1-2 percentage points).

Pros and Cons

BIO: simpler, fewer tags, standard. BIOES: clearer boundaries, better error detection, slightly more tags. Most systems use BIO (standard, benchmarks use it). BIOES gaining adoption.

Sequence Labeling as Structured Prediction

Structured Prediction

Unlike classification (independent labels), NER predicts label sequence with constraints. Valid sequences: B-PER followed by I-PER (same entity) or O (new entity). Invalid: I-PER followed by B-LOC without O separator (overlapping entities).

Sequence Models

Models capture label dependencies. Hidden Markov Model: P(y | x) = product of emission and transition probabilities. CRF: discriminative sequence model. Neural: RNN, transformer.

Why Sequence Matters

Local prediction: "Google" could be company (ORG) or verb (O). Context helps: "works at Google" → ORG likely. Sequence models exploit dependencies: previous labels inform current decision.

Local (word-level): P(label_t | word_t, features)
Sequence: P(label_t | label_{t-1}, label_{t-2}, ..., word_t, context)

Feature Engineering for NER

Lexical Features

Word properties: capitalization (indicates named entity), word shape (number, letters, mixed), length, presence in gazetteers (lists of known entities).

Features for "Google":
- Capitalized: yes
- All caps: no
- Title case: yes
- In company list: yes
- In person list: no

Gazetteers

Lists of known entities. Person names, company names, cities. Feature: word in gazetteer → likely entity. Gazetteer existence signals entity but not guarantee (ambiguity: "Bank" is company name and common noun).

Contextual Features

Surrounding words: POS tags, word embeddings, syntactic context. "Works at X" pattern suggests X is location/organization. "Is a Y" suggests Y is profession/entity type.

Character-Level Features

Affixes: "Smith" ends with "-man" or "-son" (person name pattern). Morphology: "Googling" contains "Google".

External Knowledge

Wikipedia links: "John" article links to "John (name)". WordNet: synsets. Word2vec: semantic similarity (John near John, Mary near Jane).

Conditional Random Fields (CRF)

CRF Formulation

Probabilistic undirected graphical model. Scores label sequence y given observation x. Linear-chain CRF:

P(y | x) = (1/Z(x)) * exp(sum_t sum_k lambda_k f_k(y_t, y_{t-1}, x, t))

f_k: feature functions (e.g., "word is Google and label is ORG")
lambda_k: learned weights
Z(x): normalization constant

Advantages Over HMM

CRF discriminative: models P(y|x) directly (vs. HMM generative P(x,y)). Integrates multiple features naturally. No independence assumptions. Feature weights learned jointly.

Feature Functions

Transition features: y_{t-1}=I-PER, y_t=B-LOC (invalid transition, low weight). Emission features: word="Google", y_t=B-ORG (high weight).

Training and Inference

Training: maximize likelihood with L2 regularization (prevent overfitting). Inference: Viterbi algorithm finds highest-scoring label sequence. O(T * |S|^2) where T = sequence length, |S| = state space.

Performance

CRF standard approach (pre-neural): ~90% F1 on CoNLL 2003 English NER. Still competitive with neural models when good features provided.

Neural NER Models

Feedforward Neural Network

Window-based: context words → embedding → concatenate → hidden layer → softmax. Simple: "John", "works" → predict label for "works". Window typically 3-5 words. Ignores long-range dependencies.

RNN for NER

Process word sequence sequentially. At each position, hidden state carries context. BiRNN (bidirectional): forward pass captures left context, backward captures right. Better than unidirectional.

BiLSTM:
Forward LSTM: word1 → h1_fwd, word2 → h2_fwd, ...
Backward LSTM: word_n → h_n_bwd, ..., word1 → h1_bwd
Combine: h_t = [h_t_fwd; h_t_bwd]
Classification: linear layer on h_t

Character-Level Embeddings

Words represented as character sequence. CNN or LSTM over characters produces word embedding. Captures morphology: "running" = "run" + "-ning". Handles OOV: rare words still have embedding (character composition).

Word Embeddings

Pre-trained (Word2vec, GloVe) or learned. Pre-trained transfer: embeddings from large corpus help. Learned: task-specific, context-aware. Typical: combine pre-trained + character.

BiLSTM-CRF Architecture

Architecture

Input: word sequence [w1, w2, ..., wn]
Word embeddings: [e1, e2, ..., en]
BiLSTM: forward + backward → [h1, h2, ..., hn]
CRF layer: scores label sequences considering transitions
Output: y = argmax P(y | h)

BiLSTM Component

LSTM processes words, maintains hidden state. Bidirectional: captures context from both directions. At position t, h_t = [LSTM_forward(t); LSTM_backward(t)]. Combines past and future context.

CRF Component

Replaces simple softmax. Instead of independent label probabilities, scores entire sequence. Enforces consistency: B-PER → I-PER valid, B-PER → B-LOC without O invalid.

Decoding

Viterbi algorithm finds best label sequence. Unlike softmax (greedy label-by-label), Viterbi considers full sequence optimality.

Performance

BiLSTM-CRF state-of-the-art pre-transformer: ~90-91% F1 CoNLL 2003. Combines neural expressiveness (LSTM) with structured prediction (CRF). Standard baseline.

Transformer-Based NER

BERT for NER

Fine-tune pre-trained BERT. Feed word embeddings to BERT → contextual representations. Linear classifier on top predicts label per token.

Input: [CLS] John works at Google [SEP]
BERT: contextual embeddings for each token
NER head: linear layer on each token embedding → label

Advantages Over BiLSTM

  • Pre-training: BERT learned on massive corpus. Transfer learning powerful.
  • Long-Range Context: Self-attention captures dependencies across entire sequence. BiLSTM effective range ~100-200 tokens; transformers ~500+.
  • Parallelization: Transformers parallel (all tokens attend simultaneously). RNNs sequential. Faster training.
  • Performance: 92-94% F1 (BERT-based) vs. 90-91% (BiLSTM-CRF). Significant improvement.

Challenges

Subword tokenization: transformer tokenizes "United States" as "United", "States" → two tokens. Entity spanning multiple subword tokens requires merging predictions. Standard approach: first subword token label = entity label.

Multilingual NER

mBERT, XLM-R pre-trained on 100+ languages. Single model handles multiple languages. Zero-shot: train on English, test on Korean (transfers). Cross-lingual transfer works!

Entity Linking and Disambiguation

NER vs. Entity Linking

NER: identify entity mentions in text. Entity Linking: link mentions to knowledge base entities (Wikipedia articles). "Paris" NER identifies mention. Entity Linking: which Paris (city, person, mythology)?

Entity Linking Process

1. NER identifies "Paris" as location
2. Candidate generation: find Wikipedia articles about Paris
 - Paris, France (city)
 - Paris, Texas (city)
 - Paris (mythology, Trojan prince)
3. Ranking: context "visited Paris last summer" → Paris, France likely
4. Link: Wikipedia page ID

Disambiguation

Multiple candidate entities. Resolve via context: surrounding entities, topic, common sense. "John visited Paris" → person John, location Paris less ambiguous than standalone "Paris".

Knowledge Bases

Wikipedia: comprehensive, open, freely available. Wikidata: structured knowledge (RDF). Google Knowledge Graph: proprietary, comprehensive. Choice affects link coverage and quality.

Applications

Question answering: "Who wrote Harry Potter?" → extract NER "J.K. Rowling" → link to Wikipedia → answer. Information extraction: extract (subject, relation, object) triples for knowledge base.

Challenges and Limitations

Domain Adaptation

Model trained on news entities (CoNLL 2003) fails on biomedical, social media. Different entity types, distribution. Few-shot transfer emerging but still challenging.

Nested Entities

Standard BIO can't represent nesting. "New York University" nested: LOC "New York" + ORG "University". Solutions: multiple layers, boundary-based approaches (identify entity boundaries separately).

Rare Entities

New entities, misspellings, novel compounds. Embedding-based approaches help but still struggle. Data augmentation, pre-training on larger corpora help.

Cross-Lingual Transfer

mBERT enables zero-shot, but performance gaps exist. High-resource English models don't transfer perfectly to low-resource languages (different morphology, scripts).

Entity Ambiguity

"Apple" company or fruit? Context helps but sometimes insufficient. Requires external knowledge (gazetteer, Wikipedia links).

Annotation Quality

NER requires manual annotation. Inter-annotator agreement often 85-95% (disagreement even among humans). Subjective boundaries (should title be included?).

Applications

Information Extraction

Extract structured information: events (who, when, where), relations (person works at organization). Enables knowledge base construction.

Question Answering

Identify relevant entities answering question. "When was Einstein born?" → extract person entity, find birth date.

Biomedical NER

Extract drug names, diseases, genes from medical literature. BioNER more fine-grained (10+ entity types). Enables drug discovery, disease tracking.

Social Media Monitoring

Extract mentions of entities (brands, competitors, products). Track sentiment toward specific entities.

Resume Parsing

Extract names, companies, skills, dates from resumes. Populate structured candidate profiles.

Content Recommendation

Extract user interests (entities mentioned in browsing history). Recommend content on related entities.

References

  • Nadeau, D., and Sekine, S. "A Survey of Named Entity Recognition and Classification." Lingvisticae Investigationes, vol. 30, 2007.
  • Lafferty, J., McCallum, A., and Pereira, F. C. "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data." ICML, 2001.
  • Huang, Z., Xu, W., and Yu, K. "Bidirectional LSTM-CRF Models for Sequence Tagging." arXiv:1508.01991, 2015.
  • Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. "BERT: Pre-training of Deep Bidirectional Transformers." NAACL, 2019.
  • Tjong Kim Sang, E. F., and De Meulder, F. "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition." CoNLL, 2003.