Introduction
Document databases: store, retrieve, query documents (JSON/BSON objects). Flexible schema: documents can vary structure. Rich querying: more powerful than key-value, simpler than relational.
Examples: MongoDB, CouchDB, Firebase, DynamoDB (secondary). Popular: developer-friendly (JSON matches programming languages), flexible (schema evolution easy), powerful queries (filter, join, aggregate).
Trade-offs: less structured than relational (schema flexibility costs), slower joins (denormalization often used), eventual consistency (distributed). But development velocity excellent.
"Document databases enable rapid development through flexible schemas and rich querying. Bridge key-value simplicity and relational power, suitable for modern application development." -- NoSQL databases
Document Data Model
Document Concept
Document: self-contained data unit (JSON object). Fields: key-value pairs. Values: strings, numbers, arrays, nested objects, dates, binary. Hierarchical: documents contain other documents.
Collections
Similar documents grouped in collections. Like relational tables, but schema-less. Documents in collection vary structure. Flexible: add/remove fields per document.
Example Structure
Collection: users
{
"_id": 1001,
"name": "Alice",
"age": 30,
"email": "alice@example.com",
"address": {
"street": "123 Main",
"city": "NYC"
},
"hobbies": ["reading", "hiking"]
}
Same collection, different structure possible:
{
"_id": 1002,
"name": "Bob",
"phone": "555-1234" (no email, address, age)
}
Primary Key (_id)
Unique identifier, system-generated (ObjectId) or provided. Every document must have. Indexed automatically.
Advantages
Schema-less: evolve freely. Nested: less joins (denormalize). Natural: matches programming objects. Flexible: documents differ.
JSON and BSON Formats
JSON (JavaScript Object Notation)
Human-readable format: key-value pairs, arrays, nested objects. Text-based: easy to read/debug. Standard: widely supported.
BSON (Binary JSON)
Binary format: more efficient storage/processing. MongoDB uses: faster than text JSON. Includes types: dates, binary data, ObjectId. Conversion automatic (JSON <-> BSON).
Data Types Supported
String, number (int, float), boolean, null, array, object (nested), date, binary, ObjectId, timestamp. Rich types vs. simple relational.
Size Overhead
BSON slightly larger than optimal binary (includes type info, field names). Trade-off: flexibility for size. Compression reduces overhead.
Interoperability
JSON standard: many languages support. Seamless: retrieve document from DB, use in application without conversion.
Flexible Schemas
Schema-Less Advantage
No schema enforcement: add fields without migration. Documents vary structure. Supports rapid prototyping, evolving requirements.
Schema Validation (Optional)
Databases support optional validation: enforce structure if desired. Best of both: flexible when needed, strict when desired.
Migration Patterns
Add new field: existing documents lack it (null). Gradual migration: new inserts have field, old documents updated when read. No downtime.
Renaming Fields
Add new field, copy data, delete old (or keep both). Flexible: handle gradually, no bulk migration required.
Risks
Too flexible: inconsistent data, confusing. Governance: document schema (informally), validate, evolve carefully.
Querying and Filtering
Query Syntax
MongoDB: `db.users.find({age: {$gt: 30}})` (find users age > 30). CouchDB: MapReduce or query language. Rich: filter by any field.
Comparison Operators
$eq, $ne, $gt, $gte, $lt, $lte, $in, $nin. Logical: $and, $or, $not. Flexible: complex conditions.
Array Queries
Query array elements: `{hobbies: "reading"}` matches documents with reading in hobbies. $all, $elemMatch: advanced array queries.
Nested Document Queries
Query nested fields: `{"address.city": "NYC"}`. Dot notation: access nested values directly.
Projection
Select fields: `find({...}, {name: 1, age: 1})` returns only name, age. Reduce data transferred.
Indexing Strategies
Index Types
Single field: fast equality/range queries. Compound: multiple fields. Text: full-text search. Geospatial: location-based. Ttl: auto-delete expired.
Performance Impact
Without index: scan all documents (slow). With index: direct lookup (fast). Trade-off: faster queries, slower inserts (update index), extra storage.
Query Optimization
Analyze: is index used? Explain plan: shows execution. Create indexes: frequently queried fields. Monitor: performance degradation indicates missing indexes.
Index Strategy
Equality first (where field = value), then range, then sort. Index: {status: 1, date: 1} handles {status, date range, sort by date}.
Compound Indexes
Multiple fields: {status: 1, date: 1} speeds queries filtering by status and date. Order matters: status first, date second.
Aggregation Framework
Purpose
Complex data processing: grouping, counting, filtering, transforming. MapReduce-like: process large datasets efficiently.
Pipeline
Stages: $match (filter), $group (aggregate), $sort (order), $limit (top N). Process data through stages sequentially.
Example
db.sales.aggregate([
{$match: {date: {$gte: "2024-01-01"}}},
{$group: {_id: "$product", total: {$sum: "$amount"}}},
{$sort: {total: -1}},
{$limit: 10}
])
Returns top 10 products by sales.
Operators
$sum, $avg, $min, $max: aggregations. $push: collect values. $cond: conditional. Powerful: complex transformations.
Performance
In-database: faster than application code. Optimization: use indexes early stages. Avoid: large intermediate results.
Transactions and ACID
ACID Guarantees
Single document: ACID inherent. Multi-document: modern databases support (MongoDB 4+). Atomicity: all or nothing. Isolation: serialized. Consistency, Durability: guaranteed.
Multi-Document Transactions
Update multiple documents atomically. Example: transfer money (debit one, credit other). Both or neither succeed. Strong consistency.
Trade-offs
Transactions slower (locking, coordination). Eventual consistency: sacrificed. Use when necessary (critical data), avoid otherwise (faster).
Implementation
Pessimistic locking: lock before write (slow but safe). Optimistic: detect conflicts, retry. Choice affects performance.
Cost
Transactions expensive: slower throughput, higher latency. Use sparingly: design reducing transaction need (denormalization).
MongoDB Architecture
Storage Engine
WiredTiger: B-tree based, compression, caching. Embedded, efficient storage. Indexing: B-tree. Performance: optimized for SSDs.
Replication
Replica set: primary + secondaries. Writes: primary. Reads: configurable (primary, secondary). Failover automatic: new primary elected.
Sharding
Horizontal scaling: partition data across servers. Shard key: determines partition. Balancer: distributes data. Transparent: application unaware.
Query Routing
Mongos (router): forwards queries to appropriate shards. Parallel execution: multiple shards answered simultaneously. Merge results: combine for complete answer.
Consistency
Strong within replica set. Eventual across shards. Write concern: configurable. Read preference: primary, secondary, etc.
CouchDB and Replication
Multi-Master Replication
Any replica accepts writes. Distributed: all nodes equal. Replication: asynchronous, eventual consistency. Conflict resolution: application-defined.
Conflict Handling
Concurrent edits: divergent versions. CouchDB: keeps both, application resolves. Merge logic: business-defined.
Attachments
Store binary data (images, PDFs) with documents. Efficient: separate storage, reference from document.
Views (MapReduce)
Pre-calculated queries (indexes). Map: extract data. Reduce: aggregate. Efficient: avoid scanning full database.
Peer-to-Peer
Mobile/offline: sync when online. Desktop app: local CouchDB, sync to server. Seamless: transparent conflict resolution.
Document Databases Comparison
MongoDB vs. CouchDB
| Feature | MongoDB | CouchDB |
|---|---|---|
| Replication | Primary-Secondary | Multi-Master |
| Consistency | Strong | Eventual |
| Transactions | ACID | Document-level |
| Mobile Sync | No | Yes (PouchDB) |
Use Cases
MongoDB: web apps, consistent data, strong transactions. CouchDB: mobile apps, offline-first, peer-to-peer sync.
Design Patterns
Embedding vs. Referencing
Embedding: nested document (denormalize, fewer joins). Referencing: foreign key (normalize, more joins). Trade-off: speed vs. consistency.
Example
Embedded (denormalize):
{user: "Alice", address: {street: "123 Main", city: "NYC"}}
One document, no joins.
Referenced (normalize):
{user: "Alice", addressId: 1}
+ {_id: 1, street: "123 Main", city: "NYC"}
Two lookups, joins required.
When to Embed
Embedded data: read together, small (arrays < 1000 items), update together. Example: user + preferences.
When to Reference
Shared data (other documents reference), large (large arrays), updated independently. Example: orders + products (products referenced by many).
Polymorphic Documents
Same collection: different types. Discriminator field: type indicator. Flexible: handle variations without multiple collections.
References
- MongoDB Inc. "MongoDB Documentation." https://docs.mongodb.com/
- Apache CouchDB. "The Apache CouchDB Documentation." https://docs.couchdb.org/
- Chodorow, K. "MongoDB: The Definitive Guide." O'Reilly Media, 3rd edition, 2019.
- Kleppmann, M. "Designing Data-Intensive Applications." O'Reilly Media, 2017.
- Anderson, J. C., et al. "CouchDB: The Definitive Guide." O'Reilly Media, 2010.