B Tree (Trees) | What's Your IQ

Introduction

B-tree: self-balancing multi-way tree (Bayer and McCreight, 1972). Property: every node has multiple keys and children (not binary). Disk-optimized: minimizes disk I/O. Use: database indexes, file systems. Advantage: N-way branching reduces tree height. Cost: node complexity.

"B-trees revolutionized database design by recognizing that disk I/O dominates computation. By storing multiple keys per node, they minimize tree height and disk accesses, enabling efficient indexing of massive datasets." -- Database systems

B-Tree Definition

Formal Definition

B-tree of order m: tree where every node has at most m children. Keys: at most m-1 keys per node. Balanced: all leaves at same depth. Sorted: keys ordered (left < parent < right).

Key Distinction from Binary Trees

Binary: 2 children maximum. B-tree: m children maximum (hundreds possible). Keys per node: 1 key (binary), m-1 keys (B-tree). Branching: exponentially reduces height.

Example: B-Tree Order 3

Order m=3: max 2 keys, max 3 children 15 40 / | | \ 5 10|20 25|...Internal node: 2 keys (15, 40) and 3 pointersChildren: < 15, 15-40, > 40

Comparison to Binary Tree

Binary: height for 1 million nodes = ~20 (log2(10^6)). B-tree (order 512): height = ~3 (log512(10^6)). Disk reads: binary ~20, B-tree ~3 (order-of-magnitude difference).

B-Tree Properties

Core Properties

1. Every node has at most m children. 2. Every internal node (except root) has at least ceil(m/2) children. 3. Root has at least 2 children (if not leaf). 4. All leaves at same depth. 5. Data in leaves or distributed (variant-dependent).

Property Implications

Balanced: height always O(log n). Minimum occupancy: ceil(m/2) (prevents sparse trees). Dense packing: efficient space utilization. Order selection: balances node size vs tree depth.

Leaf vs Internal Nodes

Different variants: data only in leaves (B-trees) or distributed (B+ trees). B-tree: data anywhere. B+ tree: data in leaves, internal nodes for indexing only.

Tree Structure and Nodes

Internal Node Structure

Node with k keys:[ptr0, key1, ptr1, key2, ptr2, ..., keyk, ptrk]Example (k=2):[ptr0, 15, ptr1, 40, ptr2]Pointers: ptr0 (< 15), ptr1 (15-40), ptr2 (> 40)Keys: ordered (key_i < key_i+1)

Node Splitting Threshold

Full node: m children (m-1 keys). Insert into full node: split. Empty node: ceil(m/2) children minimum (except root). Underflow: merge or rebalance.

Depth Analysis

Height h: at least h leaves. Minimum keys per non-root node: ceil(m/2) - 1. For n keys: height <= log(ceil(m/2))(n+1). Low height: few disk accesses.

Example Levels

Order m=512 (typical disk block size)Min children per node: 256Max keys per node: 511Level 0 (root): up to 511 keysLevel 1: up to 511 * 256 = 130K keysLevel 2: up to 130K * 256 = 33M keysLevel 3: up to 33M * 256 = 8B keysTypical database: needs only 3-4 levels (disk accesses)

Search Operations

Search Algorithm

search(node, key): Find position i where keys[i-1] < key <= keys[i] If key == keys[i]: found If node is leaf: key not found Else: search(children[i], key) -- recursively search child

Complexity

Comparisons per node: O(log m) (binary search on keys). Nodes visited: O(log n) (tree height). Total: O(log m * log n). In practice: dominated by disk I/O (few disk accesses).

Range Search

Find all keys in range [a, b]. Start: find first key >= a. Iterate: follow siblings/parent pointers. Cost: O(log n + k) where k = number of results (optimal).

Key Lookup Optimization

Cache nodes: frequently accessed nodes in memory. Prefetch: load sibling nodes. Index caching: root typically cached (frequent access).

Insertion Algorithm

Basic Algorithm

1. Search for leaf where key belongs2. Insert key in correct position (if not duplicate)3. If leaf has m-1 keys (full): - Split leaf into two - Move middle key up to parent4. If parent becomes full: - Recursively split parent5. If root splits: create new root

Example Insertion

Insert 17 into:[10 20] | [5 8] [12 15] [25 30]Find leaf for 17: goes to [12 15]Insert 17: [12 15 17]After insertion:[10 20] | [5 8] [12 15 17] [25 30]

Split Operation

Insert 18 into full leaf [12 15 17 18]Leaf is full (4 keys, max 3): SPLITBefore: [12 15 17 18]Middle: 15After: [12] and [17 18]Move 15 up to parentParent [10 20] becomes [10 15 20] with 3 children

Cascading Splits

If parent becomes full: split parent. If grandparent becomes full: split grandparent. Cascades up to root. Creates new root if needed (tree grows by 1 level).

Deletion Algorithm

Deletion Cases

Case 1: Key in leaf - delete directly (if no underflow). Case 2: Key in internal node - replace with predecessor/successor, delete from leaf. Case 3: After deletion, node underflows - rebalance (borrow or merge).

Underflow Handling

Node has fewer than ceil(m/2)-1 keys: underflowSolution 1: Borrow from sibling If sibling has extra key: move through parentSolution 2: Merge with sibling Combine two nodes, pull key from parent

Example Deletion

Delete 15 from [10 15 20]After deletion: [10 20] (2 keys)Check: min keys = ceil(3/2) - 1 = 1Status: [10 20] still valid (has 1 key minimum)

Cascading Merges

If parent underflows after pulling key: merge parent. Cascades up to root. Root can have just 1 key (special case). If root becomes empty: delete root, tree shrinks by 1 level.

Complexity

Find: O(log n). Delete: O(1). Rebalancing: O(log n) splits/merges. Total: O(log n).

Node Splitting

Split Procedure

Split node with m children (full) and m-1 keys:1. Find middle key (at position ceil((m-1)/2))2. Create new node with upper half of keys3. Move middle key up to parent4. Parent insertion may trigger another split

Example: Order 5

Full node (5 children, 4 keys):[10 20 30 40] | | | | |Ptr Ptr Ptr Ptr PtrSplit (middle = 30):Left: [10 20] Right: [40]Move 30 to parentParent gains new child: ... 30 ... / \ / \ [10 20] [40]

Balancing Property

After split: both nodes have roughly m/2 children. Maintains balance property. Min/max constraints respected.

Node Merging

Merge Procedure

Merge two nodes with few keys:1. Check if sibling has spare key (can borrow) If yes: move key through parent (rotation) If no: merge two nodes2. Pull key from parent (becomes separator)3. Combine keys and children from both nodes4. Delete empty node

Example Rotation

Node [10] (underflow) with sibling [30 40]:Parent: [20]Borrow from right sibling:Left: [10 20] (20 moved from parent)Right: [40] (30 moved to parent)Parent: [30] (updated separator)

Example Merge

Node [10] and sibling [30] (both minimal):Parent: [20]Merge:Combined: [10 20 30] (20 from parent pulled down)Parent loses 20: becomes empty/underflows

Cascading Effect

Parent underflow: triggers parent merge. Cascades up. Root underflow: special case (becomes new root if empty).

Order Parameter (m)

Order Selection

Order m determines: node capacity, tree shape, performance. Typical: m = (disk block size / entry size). Example: block 4KB, entry 8 bytes -> m = 512.

Effects of Order

Aspect	Small Order (m=3)	Large Order (m=512)
Tree height	Large (many levels)	Small (few levels)
Node size	Small (2 keys)	Large (511 keys)
Comparisons/node	Few (linear)	More (binary search)
Disk I/O	High (many accesses)	Low (few accesses)
Best for	Memory-based (RAM)	Disk-based (databases)

Optimal Order

Minimize cost = tree height + node search cost. For disk: high order (minimize height, I/O dominates). For memory: lower order (minimize comparisons, CPU matters).

Time and Space Complexity

Time Complexity

Operation	Complexity	Notes
Search	O(log n)	Tree height
Insert	O(log n)	Find + split
Delete	O(log n)	Find + merge
Range query	O(log n + k)	k = results

Space Complexity

Storage: O(n) for n keys. Occupancy: 50% minimum (ceil(m/2) children). Effective storage: between 50%-100% full.

Practical Disk Analysis

Disk access dominates. For 1 billion keys with order 512: height ~4 levels = 4 disk reads. Memory reads: negligible compared to disk. B-trees designed specifically to minimize disk I/O.

Real-World Applications

Database Indexes

Primary use: relational databases (MySQL, PostgreSQL). Index: quickly locate rows. B+ tree variant: maintains sorted order, efficient range scans. Billions of records: practical with 3-4 tree levels.

File Systems

NTFS, ext4, HFS+: use B-trees for metadata. Fast directory lookup: few disk accesses. Efficient tree operations: insertion/deletion balanced.

Key-Value Stores

LSM trees (Bigtable, LevelDB): use B-tree-like structures. Sorted storage: efficient range queries. Compaction: reorganizes into B-tree structure.

Search Engines

Inverted indexes: search term to documents. B-tree variant: supports phrase queries. Index size: petabytes managed with B-tree structure.

NoSQL Databases

MongoDB, CouchDB: B-tree indexes for document lookup. Multi-key indexes: compound keys (multiple fields). Efficient: optimized for disk-based storage.

References

Bayer, R., and McCreight, E. "Organization and Maintenance of Large Ordered Indexes." Acta Informatica, vol. 1, 1972, pp. 173-189.
Knuth, D. E. "The Art of Computer Programming: Sorting and Searching." Vol. 3, Addison-Wesley, 2nd edition, 1998.
Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. "Introduction to Algorithms." MIT Press, 3rd edition, 2009.
Comer, D. "The Ubiquitous B-Tree." ACM Computing Surveys, vol. 11, no. 2, 1979, pp. 121-137.
Ramakrishnan, R., and Gehrke, J. "Database Management Systems." McGraw-Hill, 3rd edition, 2003.