Tokenization | What's Your IQ

Introduction

Tokenization is the process of breaking text into discrete tokens (words, subwords, or characters). Foundational step in NLP pipeline,all downstream processing operates on token sequences. Quality tokenization affects model performance, vocabulary size, inference speed, multilingual capability.

Problem: what constitutes a token? "New York" one token or two? "don't" one or two? Contractions, punctuation, compound words, hyphenation create ambiguity. Language differences complicate matters,English has spaces, Chinese doesn't.

Historical approaches: split on whitespace (naive), linguistic tokenization (complex, language-specific), subword tokenization (modern, vocabulary-efficient). Modern transformers (BERT, GPT) use subword tokenization enabling shared vocabulary across languages, handling rare words, reducing vocabulary size.

"Tokenization is often overlooked but critical,poor tokenization breaks downstream tasks. Modern subword approaches handle morphology and rare words elegantly." -- Sebastian Ruder, AI researcher

Word Tokenization

Whitespace Splitting

Simplest approach: split on spaces. "Hello world" → ["Hello", "world"]. Works for English with spaces. Fails on punctuation: "Hello, world!" → ["Hello,", "world!"] (comma attached to word).

Punctuation-Aware Tokenization

Separate punctuation from words. Rules: punctuation treated as separate tokens. "Hello, world!" → ["Hello", ",", "world", "!"]. Better for parsing but creates many tokens from few characters.

Linguistic Tokenization

Use linguistic knowledge: part-of-speech, morphology. Penn Treebank tokenizer: follows linguistic rules (contractions handled: "don't" → ["do", "n't"]). Better linguistically but language and domain-specific.

Examples (Penn Treebank):"I'm happy" → ["I", "'m", "happy"]"It's a day's work" → ["It", "'s", "a", "day", "'s", "work"]"Let's go" → ["Let", "'s", "go"]

Regex-Based Tokenizers

Define patterns matching tokens. NLTK WordTokenizer combines rules: split on spaces, separate punctuation, handle contractions. Flexible but requires careful pattern design.

Tokenization Challenges

Punctuation and Symbols

Attached vs. separate? "Hello," is punctuation part of word or separate? Convention: separate. But email addresses, URLs, decimal numbers complicate.

Contractions and Apostrophes

English contractions: "don't", "I'm", "can't", "it's". Linguistic split: separate verb from negation ("do" + "n't"). But loses morphological meaning (word form encodes information).

Compound Words

Hyphenated: "state-of-the-art", "mother-in-law". Split or keep? Context-dependent. "ill-prepared" semantically different from "ill prepared".

Contractions Across Languages

French "du" (de + le). Spanish "al" (a + el). Italian "del" (di + el). Linguistic clues help but require language-specific rules.

Language-Specific Issues

Chinese, Japanese, Korean: No spaces. Require segmentation. "我喜欢自然语言处理" (I like NLP) could be segmented multiple ways.
Arabic, Hebrew: Right-to-left scripts. Clitics attach to words. "والمكتبة" (wa + al + maktaba = and + the + library).
Agglutinative languages (Finnish, Turkish, Korean): Single word = many morphemes. "Türkiye'nin" (Turkey's) is one word, many morphs.

Out-of-Vocabulary (OOV) Words

Tokenizer vocabulary finite. Rare words, misspellings, new words not in vocabulary. Traditional solution: OOV token. Modern: subword tokens capture rare words as sequence of common subwords.

Subword Tokenization Methods

Motivation

Word-level vocabularies large (50,000+ for language models). But rare words appear infrequently, wasting embedding parameters. Subword: morphemes, roots, affixes. "running" = "run" + "ning". Captures morphology without full vocabulary.

Character-Level Tokenization

Smallest tokens: individual characters. Tiny vocabulary (26 letters + digits + punctuation = 100 tokens). Downsides: sequences very long (word = multiple character tokens), loses word-level structure, harder for models to learn.

Character tokens: "hello" → ['h', 'e', 'l', 'l', 'o']Word tokens: "hello" → ['hello']Subword tokens: "hello" → ['hello'] or ['hel', '##lo'] depending on vocab

Morphological Segmentation

Use linguistic knowledge: identify morphemes. "unbelievable" → ["un", "believe", "able"]. Requires morphological analyzer (language-specific, complex). Works well for morphologically rich languages.

Statistical Subword Methods

Data-driven: learn subword units from corpus frequency. More common subwords (frequent bigrams) become units. Byte Pair Encoding, WordPiece, SentencePiece. Language-agnostic, learn from data.

Byte Pair Encoding (BPE)

Algorithm

Iteratively merge most common adjacent bytes/characters. Start: characters are tokens. Repeatedly: find most frequent pair, merge into new token. Continue until vocabulary size reached or no improvements.

Initial: v = {e,d,h,l,o,r,s,t,...} [all characters]Iteration 1: Count pairs: es=10, ed=5, he=3, ar=2, ... Most common: es, frequency=10 Merge: create token 'es' v = {e,d,h,l,o,r,s,t,es,...}Iteration 2: Count pairs: hes=5, ar=2, ed=5, ... [new pairs after merge] Most common: hes Merge: create 'hes' ...continues

Vocabulary Building

Typically 30,000-50,000 merge operations. Final vocabulary = original characters + learned merge operations. Example: final vocabulary might have ['un', 'believe', 'able', 'running', 'ly', ...]

Encoding Process

Given word, apply learned merges greedily: find longest matching subword token, consume it, recurse on remainder.

Word: "undoing"Apply merges in order learned:"undoing" → check if 'un' in vocab (yes) → ['un', 'doing']'doing' → check if 'do' in vocab (yes) → ['un', 'do', 'ing']'ing' → already in vocab → ['un', 'do', 'ing']

Advantages

Language-agnostic (learns from data). Handles rare words gracefully (rare word = sequence of common subwords). Vocabulary more compact than word-level. Learned subwords interpretable (meaningful morphemes emerge).

Disadvantages

Not unique: multiple ways to encode word depending on merge order. Sensitive to training data distribution. Greedy encoding not optimal (locally greedy decisions not globally optimal).

WordPiece and SentencePiece

WordPiece (BERT, Google)

Similar to BPE but differs: selects merge based on likelihood increase, not frequency. Greedier: selects pairs maximizing likelihood of vocabulary given training data.

Instead of: frequency(pair) highestUse: log(freq(pair)) - log(freq(left)) - log(freq(right))Selects pairs that increase likelihood. Produces fewer, more meaningful subwords.

BERT tokenization: 30,000 token vocabulary. WordPiece tokens marked with ## prefix when not word-initial: "running" → ["run", "##ning"].

SentencePiece (Google)

Language-agnostic: treats input as sequence of bytes, not characters. No assumption of spaces (handles CJK naturally). Learns both encoding (merge operations) and vocabulary.

Key difference: space is regular character. "hello world" treated as single sequence of bytes. SentencePiece handles all languages uniformly.

Input (bytes): "hello world"After tokenization: ["hello", "▁world"] [▁ = space]For CJK: "我喜欢" treated as bytes, no space assumption needed

Comparison Table

Method	Merge Criterion	Space Handling	Multilingual
BPE	Highest frequency	Explicit spaces	Good (language-agnostic)
WordPiece	Likelihood increase	Explicit spaces + ## prefix	Good
SentencePiece	BPE or unigram LM	Space as token (universal)	Excellent (no assumptions)

Language-Specific Tokenization

Chinese and Japanese

No spaces. Require word segmentation,identifying word boundaries. Chinese: ambiguous (multiple valid segmentations). Jieba, ICTCLAS popular segmenters. Leverage dictionary + statistical models.

Japanese: three scripts (hiragana, katakana, kanji). MeCab tokenizer handles morphological analysis simultaneously.

Arabic

Right-to-left, complex morphology. Clitics attach to words. "والمكتبة" = wa (and) + al (the) + maktaba (library). Proper tokenization requires stemming (removing clitics/affixes).

German and Dutch

Productive compounding: "Donaudampfschifffahrtsgesellschaftskapitän" (single word, complex meaning). Split into meaningful components or keep whole? Context-dependent.

Multilingual Systems

Modern approaches: single tokenizer for multiple languages. SentencePiece unified vocabulary handles 100+ languages. Shared subword units enable zero-shot cross-lingual transfer.

Morphological Tokenization

Stemming

Remove affixes, reduce to root: "running", "runner", "runs" → "run". Heuristic rules (remove -ing, -er, -s). Fast but lossy (conflates different forms, occasional errors).

Porter stemmer (English):"running" → "run" (remove -ing)"cries" → "cri" (remove -es, shorten i→ε)"confusion" → "confus" (remove -ion)

Lemmatization

Reduce to canonical form (lemma) using morphological analysis. "running" → "run" (verb, present participle). Accurate but requires morphological lexicon (language-specific, expensive).

Morphological Segmentation

Identify morpheme boundaries: "running" → "run" + "-ing", "unmistakable" → "un-" + "mistake" + "-able". Captures morphological structure. Valuable for morphologically rich languages (Turkish, Finnish, Arabic).

Special Tokens and Token Types

Special Tokens in Transformers

BERT uses special tokens: [CLS] (classification token, beginning), [SEP] (separator between sentences), [PAD] (padding, vocabulary), [UNK] (unknown word), [MASK] (masked for pre-training).

Example BERT tokenization:Input: "Hello world"Tokens: [CLS] Hello world [SEP]IDs: [101] 7592 2088 [102] [token IDs]For masked language modeling:Input: "I [MASK] a dog"Predict: dog predicted at [MASK] position

Token Type IDs

Distinguish sentence pairs. Segment IDs: first sentence=0, second=1. Enables model to differentiate which token belongs to which sentence.

Custom Tokens

Add domain-specific tokens: medical domain adds [DIAG], [DRUG], [SYMPTOM]. Named entity tags become single tokens. Improves efficiency, enables fine-grained semantics.

Evaluating Tokenization

Vocabulary Size

Smaller is efficient (smaller embedding matrices, faster inference). Larger preserves words intact (fewer subword tokens per word). Trade-off: 30,000-50,000 typical balance.

Average Tokens per Word

Metric: (total tokens) / (total words). Ideal: close to 1 (words preserved). Reality: 1.3-1.5 typical (some words split). High ratio indicates too-aggressive subword splitting.

Downstream Task Performance

Ultimately: does tokenization scheme help or hurt downstream tasks? Evaluate on classification, NER, translation. Different tasks may prefer different tokenization.

Linguistic Quality

Do learned subwords correspond to meaningful morphemes? Visual inspection, linguistic evaluation. SentencePiece typically learns interpretable units. Random subwords suggest poor tokenization.

Rare Word Handling

Can rare words be represented? Out-of-vocabulary rate for test set. Subword methods should have near-zero OOV (rare words = sequence of common subwords).

Applications and Implementation

Tokenization in Production

Modern NLP pipelines: pre-trained tokenizers (BERT, GPT, etc.) available. Use existing tokenizer matching pre-trained model. Tokenization must be consistent: training and inference use identical tokenizer.

Custom Tokenizers

Training custom tokenizer on domain-specific corpus. Medical NLP: BioBERT uses domain vocabulary. Code understanding: tokenizers handle special characters, operators, syntax.

Tokenizer Libraries

NLTK: Classical tokenizers (word, sent, regex). Lightweight, useful for English.
spaCy: Fast, language-aware. Handles multiple languages, POS tagging concurrent.
Hugging Face Transformers: Pre-trained tokenizers matching models. Simple API: tokenizer.encode(text)
SentencePiece: Learn custom BPE/unigram tokenizer. Language-agnostic.

Practical Considerations

Tokenizer versioning: tokenizer updates break compatibility. Production systems must fix tokenizer version. Mismatch between training and inference tokenizer causes silent failures (different tokens = different predictions).

References

Sennrich, R., Haddow, B., and Birch, A. "Neural Machine Translation of Rare Words with Subword Units." ACL, 2016.
Wu, Y., et al. "Google's Neural Machine Translation System." arXiv:1609.08144, 2016.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. "BERT: Pre-training of Deep Bidirectional Transformers." NAACL, 2019.
Kudo, T., and Richardson, J. "SentencePiece: A Simple and Language Agnostic Approach to Subword Segmentation." ACL, 2018.
Bird, S., Klein, E., and Loper, E. "Natural Language Processing with Python." O'Reilly, 2009.