Introduction
Tokenization is the process of breaking text into discrete tokens (words, subwords, or characters). Foundational step in NLP pipeline,all downstream processing operates on token sequences. Quality tokenization affects model performance, vocabulary size, inference speed, multilingual capability.
Problem: what constitutes a token? "New York" one token or two? "don't" one or two? Contractions, punctuation, compound words, hyphenation create ambiguity. Language differences complicate matters,English has spaces, Chinese doesn't.
Historical approaches: split on whitespace (naive), linguistic tokenization (complex, language-specific), subword tokenization (modern, vocabulary-efficient). Modern transformers (BERT, GPT) use subword tokenization enabling shared vocabulary across languages, handling rare words, reducing vocabulary size.
"Tokenization is often overlooked but critical,poor tokenization breaks downstream tasks. Modern subword approaches handle morphology and rare words elegantly." -- Sebastian Ruder, AI researcher
Word Tokenization
Whitespace Splitting
Simplest approach: split on spaces. "Hello world" → ["Hello", "world"]. Works for English with spaces. Fails on punctuation: "Hello, world!" → ["Hello,", "world!"] (comma attached to word).
Punctuation-Aware Tokenization
Separate punctuation from words. Rules: punctuation treated as separate tokens. "Hello, world!" → ["Hello", ",", "world", "!"]. Better for parsing but creates many tokens from few characters.
Linguistic Tokenization
Use linguistic knowledge: part-of-speech, morphology. Penn Treebank tokenizer: follows linguistic rules (contractions handled: "don't" → ["do", "n't"]). Better linguistically but language and domain-specific.
Examples (Penn Treebank):"I'm happy" → ["I", "'m", "happy"]"It's a day's work" → ["It", "'s", "a", "day", "'s", "work"]"Let's go" → ["Let", "'s", "go"]Regex-Based Tokenizers
Define patterns matching tokens. NLTK WordTokenizer combines rules: split on spaces, separate punctuation, handle contractions. Flexible but requires careful pattern design.
Tokenization Challenges
Punctuation and Symbols
Attached vs. separate? "Hello," is punctuation part of word or separate? Convention: separate. But email addresses, URLs, decimal numbers complicate.
Contractions and Apostrophes
English contractions: "don't", "I'm", "can't", "it's". Linguistic split: separate verb from negation ("do" + "n't"). But loses morphological meaning (word form encodes information).
Compound Words
Hyphenated: "state-of-the-art", "mother-in-law". Split or keep? Context-dependent. "ill-prepared" semantically different from "ill prepared".
Contractions Across Languages
French "du" (de + le). Spanish "al" (a + el). Italian "del" (di + el). Linguistic clues help but require language-specific rules.
Language-Specific Issues
- Chinese, Japanese, Korean: No spaces. Require segmentation. "我喜欢自然语言处理" (I like NLP) could be segmented multiple ways.
- Arabic, Hebrew: Right-to-left scripts. Clitics attach to words. "والمكتبة" (wa + al + maktaba = and + the + library).
- Agglutinative languages (Finnish, Turkish, Korean): Single word = many morphemes. "Türkiye'nin" (Turkey's) is one word, many morphs.
Out-of-Vocabulary (OOV) Words
Tokenizer vocabulary finite. Rare words, misspellings, new words not in vocabulary. Traditional solution: OOV token. Modern: subword tokens capture rare words as sequence of common subwords.
Subword Tokenization Methods
Motivation
Word-level vocabularies large (50,000+ for language models). But rare words appear infrequently, wasting embedding parameters. Subword: morphemes, roots, affixes. "running" = "run" + "ning". Captures morphology without full vocabulary.
Character-Level Tokenization
Smallest tokens: individual characters. Tiny vocabulary (26 letters + digits + punctuation = 100 tokens). Downsides: sequences very long (word = multiple character tokens), loses word-level structure, harder for models to learn.
Character tokens: "hello" → ['h', 'e', 'l', 'l', 'o']Word tokens: "hello" → ['hello']Subword tokens: "hello" → ['hello'] or ['hel', '##lo'] depending on vocabMorphological Segmentation
Use linguistic knowledge: identify morphemes. "unbelievable" → ["un", "believe", "able"]. Requires morphological analyzer (language-specific, complex). Works well for morphologically rich languages.
Statistical Subword Methods
Data-driven: learn subword units from corpus frequency. More common subwords (frequent bigrams) become units. Byte Pair Encoding, WordPiece, SentencePiece. Language-agnostic, learn from data.
Byte Pair Encoding (BPE)
Algorithm
Iteratively merge most common adjacent bytes/characters. Start: characters are tokens. Repeatedly: find most frequent pair, merge into new token. Continue until vocabulary size reached or no improvements.
Initial: v = {e,d,h,l,o,r,s,t,...} [all characters]Iteration 1: Count pairs: es=10, ed=5, he=3, ar=2, ... Most common: es, frequency=10 Merge: create token 'es' v = {e,d,h,l,o,r,s,t,es,...}Iteration 2: Count pairs: hes=5, ar=2, ed=5, ... [new pairs after merge] Most common: hes Merge: create 'hes' ...continuesVocabulary Building
Typically 30,000-50,000 merge operations. Final vocabulary = original characters + learned merge operations. Example: final vocabulary might have ['un', 'believe', 'able', 'running', 'ly', ...]
Encoding Process
Given word, apply learned merges greedily: find longest matching subword token, consume it, recurse on remainder.
Word: "undoing"Apply merges in order learned:"undoing" → check if 'un' in vocab (yes) → ['un', 'doing']'doing' → check if 'do' in vocab (yes) → ['un', 'do', 'ing']'ing' → already in vocab → ['un', 'do', 'ing']Advantages
Language-agnostic (learns from data). Handles rare words gracefully (rare word = sequence of common subwords). Vocabulary more compact than word-level. Learned subwords interpretable (meaningful morphemes emerge).
Disadvantages
Not unique: multiple ways to encode word depending on merge order. Sensitive to training data distribution. Greedy encoding not optimal (locally greedy decisions not globally optimal).
WordPiece and SentencePiece
WordPiece (BERT, Google)
Similar to BPE but differs: selects merge based on likelihood increase, not frequency. Greedier: selects pairs maximizing likelihood of vocabulary given training data.
Instead of: frequency(pair) highestUse: log(freq(pair)) - log(freq(left)) - log(freq(right))Selects pairs that increase likelihood. Produces fewer, more meaningful subwords.BERT tokenization: 30,000 token vocabulary. WordPiece tokens marked with ## prefix when not word-initial: "running" → ["run", "##ning"].
SentencePiece (Google)
Language-agnostic: treats input as sequence of bytes, not characters. No assumption of spaces (handles CJK naturally). Learns both encoding (merge operations) and vocabulary.
Key difference: space is regular character. "hello world" treated as single sequence of bytes. SentencePiece handles all languages uniformly.
Input (bytes): "hello world"After tokenization: ["hello", "▁world"] [▁ = space]For CJK: "我喜欢" treated as bytes, no space assumption neededComparison Table
| Method | Merge Criterion | Space Handling | Multilingual |
|---|---|---|---|
| BPE | Highest frequency | Explicit spaces | Good (language-agnostic) |
| WordPiece | Likelihood increase | Explicit spaces + ## prefix | Good |
| SentencePiece | BPE or unigram LM | Space as token (universal) | Excellent (no assumptions) |
Language-Specific Tokenization
Chinese and Japanese
No spaces. Require word segmentation,identifying word boundaries. Chinese: ambiguous (multiple valid segmentations). Jieba, ICTCLAS popular segmenters. Leverage dictionary + statistical models.
Japanese: three scripts (hiragana, katakana, kanji). MeCab tokenizer handles morphological analysis simultaneously.
Arabic
Right-to-left, complex morphology. Clitics attach to words. "والمكتبة" = wa (and) + al (the) + maktaba (library). Proper tokenization requires stemming (removing clitics/affixes).
German and Dutch
Productive compounding: "Donaudampfschifffahrtsgesellschaftskapitän" (single word, complex meaning). Split into meaningful components or keep whole? Context-dependent.
Multilingual Systems
Modern approaches: single tokenizer for multiple languages. SentencePiece unified vocabulary handles 100+ languages. Shared subword units enable zero-shot cross-lingual transfer.
Morphological Tokenization
Stemming
Remove affixes, reduce to root: "running", "runner", "runs" → "run". Heuristic rules (remove -ing, -er, -s). Fast but lossy (conflates different forms, occasional errors).
Porter stemmer (English):"running" → "run" (remove -ing)"cries" → "cri" (remove -es, shorten i→ε)"confusion" → "confus" (remove -ion)Lemmatization
Reduce to canonical form (lemma) using morphological analysis. "running" → "run" (verb, present participle). Accurate but requires morphological lexicon (language-specific, expensive).
Morphological Segmentation
Identify morpheme boundaries: "running" → "run" + "-ing", "unmistakable" → "un-" + "mistake" + "-able". Captures morphological structure. Valuable for morphologically rich languages (Turkish, Finnish, Arabic).
Special Tokens and Token Types
Special Tokens in Transformers
BERT uses special tokens: [CLS] (classification token, beginning), [SEP] (separator between sentences), [PAD] (padding, vocabulary), [UNK] (unknown word), [MASK] (masked for pre-training).
Example BERT tokenization:Input: "Hello world"Tokens: [CLS] Hello world [SEP]IDs: [101] 7592 2088 [102] [token IDs]For masked language modeling:Input: "I [MASK] a dog"Predict: dog predicted at [MASK] positionToken Type IDs
Distinguish sentence pairs. Segment IDs: first sentence=0, second=1. Enables model to differentiate which token belongs to which sentence.
Custom Tokens
Add domain-specific tokens: medical domain adds [DIAG], [DRUG], [SYMPTOM]. Named entity tags become single tokens. Improves efficiency, enables fine-grained semantics.
Evaluating Tokenization
Vocabulary Size
Smaller is efficient (smaller embedding matrices, faster inference). Larger preserves words intact (fewer subword tokens per word). Trade-off: 30,000-50,000 typical balance.
Average Tokens per Word
Metric: (total tokens) / (total words). Ideal: close to 1 (words preserved). Reality: 1.3-1.5 typical (some words split). High ratio indicates too-aggressive subword splitting.
Downstream Task Performance
Ultimately: does tokenization scheme help or hurt downstream tasks? Evaluate on classification, NER, translation. Different tasks may prefer different tokenization.
Linguistic Quality
Do learned subwords correspond to meaningful morphemes? Visual inspection, linguistic evaluation. SentencePiece typically learns interpretable units. Random subwords suggest poor tokenization.
Rare Word Handling
Can rare words be represented? Out-of-vocabulary rate for test set. Subword methods should have near-zero OOV (rare words = sequence of common subwords).
Applications and Implementation
Tokenization in Production
Modern NLP pipelines: pre-trained tokenizers (BERT, GPT, etc.) available. Use existing tokenizer matching pre-trained model. Tokenization must be consistent: training and inference use identical tokenizer.
Custom Tokenizers
Training custom tokenizer on domain-specific corpus. Medical NLP: BioBERT uses domain vocabulary. Code understanding: tokenizers handle special characters, operators, syntax.
Tokenizer Libraries
- NLTK: Classical tokenizers (word, sent, regex). Lightweight, useful for English.
- spaCy: Fast, language-aware. Handles multiple languages, POS tagging concurrent.
- Hugging Face Transformers: Pre-trained tokenizers matching models. Simple API: tokenizer.encode(text)
- SentencePiece: Learn custom BPE/unigram tokenizer. Language-agnostic.
Practical Considerations
Tokenizer versioning: tokenizer updates break compatibility. Production systems must fix tokenizer version. Mismatch between training and inference tokenizer causes silent failures (different tokens = different predictions).
References
- Sennrich, R., Haddow, B., and Birch, A. "Neural Machine Translation of Rare Words with Subword Units." ACL, 2016.
- Wu, Y., et al. "Google's Neural Machine Translation System." arXiv:1609.08144, 2016.
- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. "BERT: Pre-training of Deep Bidirectional Transformers." NAACL, 2019.
- Kudo, T., and Richardson, J. "SentencePiece: A Simple and Language Agnostic Approach to Subword Segmentation." ACL, 2018.
- Bird, S., Klein, E., and Loper, E. "Natural Language Processing with Python." O'Reilly, 2009.