Document Databases (Nosql)

Introduction

Document databases: store, retrieve, query documents (JSON/BSON objects). Flexible schema: documents can vary structure. Rich querying: more powerful than key-value, simpler than relational.

Examples: MongoDB, CouchDB, Firebase, DynamoDB (secondary). Popular: developer-friendly (JSON matches programming languages), flexible (schema evolution easy), powerful queries (filter, join, aggregate).

Trade-offs: less structured than relational (schema flexibility costs), slower joins (denormalization often used), eventual consistency (distributed). But development velocity excellent.

"Document databases enable rapid development through flexible schemas and rich querying. Bridge key-value simplicity and relational power, suitable for modern application development." -- NoSQL databases

Document Data Model

Document Concept

Document: self-contained data unit (JSON object). Fields: key-value pairs. Values: strings, numbers, arrays, nested objects, dates, binary. Hierarchical: documents contain other documents.

Collections

Similar documents grouped in collections. Like relational tables, but schema-less. Documents in collection vary structure. Flexible: add/remove fields per document.

Example Structure

Collection: users{ "_id": 1001, "name": "Alice", "age": 30, "email": "alice@example.com", "address": { "street": "123 Main", "city": "NYC" }, "hobbies": ["reading", "hiking"]}Same collection, different structure possible:{ "_id": 1002, "name": "Bob", "phone": "555-1234" (no email, address, age)}

Primary Key (_id)

Unique identifier, system-generated (ObjectId) or provided. Every document must have. Indexed automatically.

Advantages

Schema-less: evolve freely. Nested: less joins (denormalize). Natural: matches programming objects. Flexible: documents differ.

JSON and BSON Formats

JSON (JavaScript Object Notation)

Human-readable format: key-value pairs, arrays, nested objects. Text-based: easy to read/debug. Standard: widely supported.

BSON (Binary JSON)

Binary format: more efficient storage/processing. MongoDB uses: faster than text JSON. Includes types: dates, binary data, ObjectId. Conversion automatic (JSON <-> BSON).

Data Types Supported

String, number (int, float), boolean, null, array, object (nested), date, binary, ObjectId, timestamp. Rich types vs. simple relational.

Size Overhead

BSON slightly larger than optimal binary (includes type info, field names). Trade-off: flexibility for size. Compression reduces overhead.

Interoperability

JSON standard: many languages support. Seamless: retrieve document from DB, use in application without conversion.

Flexible Schemas

Schema-Less Advantage

No schema enforcement: add fields without migration. Documents vary structure. Supports rapid prototyping, evolving requirements.

Schema Validation (Optional)

Databases support optional validation: enforce structure if desired. Best of both: flexible when needed, strict when desired.

Migration Patterns

Add new field: existing documents lack it (null). Gradual migration: new inserts have field, old documents updated when read. No downtime.

Renaming Fields

Add new field, copy data, delete old (or keep both). Flexible: handle gradually, no bulk migration required.

Risks

Too flexible: inconsistent data, confusing. Governance: document schema (informally), validate, evolve carefully.

Querying and Filtering

Query Syntax

MongoDB: `db.users.find({age: {$gt: 30}})` (find users age > 30). CouchDB: MapReduce or query language. Rich: filter by any field.

Comparison Operators

$eq, $ne, $gt, $gte, $lt, $lte, $in, $nin. Logical: $and, $or, $not. Flexible: complex conditions.

Array Queries

Query array elements: `{hobbies: "reading"}` matches documents with reading in hobbies. $all, $elemMatch: advanced array queries.

Nested Document Queries

Query nested fields: `{"address.city": "NYC"}`. Dot notation: access nested values directly.

Projection

Select fields: `find({...}, {name: 1, age: 1})` returns only name, age. Reduce data transferred.

Indexing Strategies

Index Types

Single field: fast equality/range queries. Compound: multiple fields. Text: full-text search. Geospatial: location-based. Ttl: auto-delete expired.

Performance Impact

Without index: scan all documents (slow). With index: direct lookup (fast). Trade-off: faster queries, slower inserts (update index), extra storage.

Query Optimization

Analyze: is index used? Explain plan: shows execution. Create indexes: frequently queried fields. Monitor: performance degradation indicates missing indexes.

Index Strategy

Equality first (where field = value), then range, then sort. Index: {status: 1, date: 1} handles {status, date range, sort by date}.

Compound Indexes

Multiple fields: {status: 1, date: 1} speeds queries filtering by status and date. Order matters: status first, date second.

Aggregation Framework

Purpose

Complex data processing: grouping, counting, filtering, transforming. MapReduce-like: process large datasets efficiently.

Pipeline

Stages: $match (filter), $group (aggregate), $sort (order), $limit (top N). Process data through stages sequentially.

Example

db.sales.aggregate([ {$match: {date: {$gte: "2024-01-01"}}}, {$group: {_id: "$product", total: {$sum: "$amount"}}}, {$sort: {total: -1}}, {$limit: 10}])Returns top 10 products by sales.

Operators

$sum, $avg, $min, $max: aggregations. $push: collect values. $cond: conditional. Powerful: complex transformations.

Performance

In-database: faster than application code. Optimization: use indexes early stages. Avoid: large intermediate results.

Transactions and ACID

ACID Guarantees

Single document: ACID inherent. Multi-document: modern databases support (MongoDB 4+). Atomicity: all or nothing. Isolation: serialized. Consistency, Durability: guaranteed.

Multi-Document Transactions

Update multiple documents atomically. Example: transfer money (debit one, credit other). Both or neither succeed. Strong consistency.

Trade-offs

Transactions slower (locking, coordination). Eventual consistency: sacrificed. Use when necessary (critical data), avoid otherwise (faster).

Implementation

Pessimistic locking: lock before write (slow but safe). Optimistic: detect conflicts, retry. Choice affects performance.

Cost

Transactions expensive: slower throughput, higher latency. Use sparingly: design reducing transaction need (denormalization).

MongoDB Architecture

Storage Engine

WiredTiger: B-tree based, compression, caching. Embedded, efficient storage. Indexing: B-tree. Performance: optimized for SSDs.

Replication

Replica set: primary + secondaries. Writes: primary. Reads: configurable (primary, secondary). Failover automatic: new primary elected.

Sharding

Horizontal scaling: partition data across servers. Shard key: determines partition. Balancer: distributes data. Transparent: application unaware.

Query Routing

Mongos (router): forwards queries to appropriate shards. Parallel execution: multiple shards answered simultaneously. Merge results: combine for complete answer.

Consistency

Strong within replica set. Eventual across shards. Write concern: configurable. Read preference: primary, secondary, etc.

CouchDB and Replication

Multi-Master Replication

Any replica accepts writes. Distributed: all nodes equal. Replication: asynchronous, eventual consistency. Conflict resolution: application-defined.

Conflict Handling

Concurrent edits: divergent versions. CouchDB: keeps both, application resolves. Merge logic: business-defined.

Attachments

Store binary data (images, PDFs) with documents. Efficient: separate storage, reference from document.

Views (MapReduce)

Pre-calculated queries (indexes). Map: extract data. Reduce: aggregate. Efficient: avoid scanning full database.

Peer-to-Peer

Mobile/offline: sync when online. Desktop app: local CouchDB, sync to server. Seamless: transparent conflict resolution.

Document Databases Comparison

MongoDB vs. CouchDB

Feature	MongoDB	CouchDB
Replication	Primary-Secondary	Multi-Master
Consistency	Strong	Eventual
Transactions	ACID	Document-level
Mobile Sync	No	Yes (PouchDB)

Use Cases

MongoDB: web apps, consistent data, strong transactions. CouchDB: mobile apps, offline-first, peer-to-peer sync.

Design Patterns

Embedding vs. Referencing

Embedding: nested document (denormalize, fewer joins). Referencing: foreign key (normalize, more joins). Trade-off: speed vs. consistency.

Example

Embedded (denormalize):{user: "Alice", address: {street: "123 Main", city: "NYC"}}One document, no joins.Referenced (normalize):{user: "Alice", addressId: 1}+ {_id: 1, street: "123 Main", city: "NYC"}Two lookups, joins required.

When to Embed

Embedded data: read together, small (arrays < 1000 items), update together. Example: user + preferences.

When to Reference

Shared data (other documents reference), large (large arrays), updated independently. Example: orders + products (products referenced by many).

Polymorphic Documents

Same collection: different types. Discriminator field: type indicator. Flexible: handle variations without multiple collections.

References

MongoDB Inc. "MongoDB Documentation." https://docs.mongodb.com/
Apache CouchDB. "The Apache CouchDB Documentation." https://docs.couchdb.org/
Chodorow, K. "MongoDB: The Definitive Guide." O'Reilly Media, 3rd edition, 2019.
Kleppmann, M. "Designing Data-Intensive Applications." O'Reilly Media, 2017.
Anderson, J. C., et al. "CouchDB: The Definitive Guide." O'Reilly Media, 2010.