Introduction
Graph database: specialized database optimizing connected data representation. Native: stores relationships explicitly (not joins). Queries: traverse connections directly (not computed). Examples: Neo4j, ArangoDB, Amazon Neptune. Advantage: fast traversal, expressive queries. Use: social networks, knowledge graphs, recommendations.
"Graph databases realize a fundamental truth: the world is connected. Every entity has relationships to others. Graph databases make these connections first-class citizens, not afterthoughts, enabling queries impossible in traditional databases." -- Graph theory
Graph Database Definition
Core Concept
Graph: mathematical structure (nodes, edges). Database: persistent storage with query. Native: traversal native operations (not joins). Property: attributes on nodes/edges. Directed/undirected: edges have direction or not.
Difference from Relational
RDBMS: relationships via foreign keys (computed, not stored). Joins: expensive (multiple table scans). Graph: relationships explicit (stored, traversed). Paths: natural, efficient.
Graph Theory Terminology
Vertex/Node: entity (person, place, thing). Edge/Relationship: connection (knows, likes, owns). Property: attribute (name, weight, type). Degree: number of edges connected to node. Path: sequence of edges.
Examples
Social Network: Nodes: people (Alice, Bob) Edges: friendships (Alice--knows-->Bob) Properties: name, age, timestampKnowledge Graph: Nodes: concepts, entities Edges: relationships (hypernym, meronym) Properties: definitions, descriptionsGraph Data Model
Property Graph Model
Most common: nodes and edges have properties. Key-value pairs: define attributes. Type: labeled (nodes, edges). Flexible: schema-optional.
Triple/RDF Model
Subject-predicate-object: semantic web standard. (Alice, knows, Bob). Reasoning: inference rules. Standardized: W3C SPARQL query language.
Hypergraph Model
Edges connect multiple nodes (not just pairs). More expressive: model complex relationships. Less common: fewer implementations.
Comparison
| Model | Structure | Use |
|---|---|---|
| Property Graph | Nodes + Edges + Properties | Social, Recommendations |
| RDF | Triples (SPO) | Semantic Web, Reasoning |
| Hypergraph | Multi-node edges | Complex domains |
Nodes and Edges
Nodes (Vertices)
Entities: person, place, product. Labels: classification (Person, City, Movie). Properties: attributes (name, location, year). Relationships: connected to other nodes via edges.
Edges (Relationships)
Connections: between nodes. Types: KNOWS, LIKES, OWNS (labeled). Direction: source -> target (directed). Weight: strength or cost (optional).
Example Structure
Node Alice (Person): Properties: name="Alice", age=30, city="NYC" Edges: KNOWS, WORKS_FOR, LIKESNode Bob (Person): Properties: name="Bob", age=25, city="SF" Edges: KNOWS, WORKS_FOR, LIKESEdge KNOWS (Alice -> Bob): Properties: since=2020, strength=0.8 Direction: Alice knows Bob (may or may not be mutual)Cardinality
One-to-many: one node to many edges. Many-to-many: natural (just edges). No joins: relationships are direct.
Graph Properties and Attributes
Node Properties
Intrinsic: describe entity. Example: Person node (name, age, email). Indexed: enable fast lookup. Optional: schema-flexible.
Edge Properties
Relationship attributes: since date, confidence, weight. Metadata: when created, who created. Support: traversal filtering.
Labels and Types
Node labels: Person, Organization, LocationEdge types: KNOWS, WORKS_FOR, LOCATED_INQuery: MATCH (p:Person)-[:KNOWS]->(friend:Person)-- Find all people and their friends (labeled nodes, typed edges)Indexing Properties
Primary: on frequently queried columns. Secondary: support filtering. Compound: multiple properties together. Performance: critical for large graphs.
Constraints
Unique: property must be unique (ID). Required: property must exist. Type: enforce data type. Referential: edge endpoints must exist.
Graph Query Languages
Cypher (Neo4j)
-- Match pattern: Alice knows BobMATCH (a:Person {name:"Alice"})-[:KNOWS]->(b:Person)RETURN b.name-- More complex: friends of friendsMATCH (a:Person {name:"Alice"})-[:KNOWS]->()-[:KNOWS]->(fof)RETURN DISTINCT fof.nameGremlin (Traversal Language)
g.V().has('name', 'Alice') .out('KNOWS') .out('KNOWS') .values('name')-- More imperative: step-by-step traversalSPARQL (RDF Databases)
PREFIX foaf: SELECT ?name WHERE { ?alice foaf:name "Alice" . ?alice foaf:knows ?friend . ?friend foaf:name ?name .}Comparison
Cypher: intuitive, pattern matching. Gremlin: flexible, procedural. SPARQL: standardized, semantic. Choice: depends on system and preference.
Graph Traversal Algorithms
Depth-First Search (DFS)
Explore graph: go deep first (follow single path far)Use: reachability, cycles, topological sortImplementation: recursive or stack-basedBreadth-First Search (BFS)
Explore graph: go wide first (visit neighbors level by level)Use: shortest path, level-order traversalImplementation: queue-basedExample: social network (distance from person)Shortest Path (Dijkstra)
Find minimum-cost path between nodesUse: navigation, routing, recommendationsRequirement: non-negative weightsPageRank and Centrality
Importance: how central is a node? Used: recommendation, influence analysis. Iterative: based on incoming connections.
Community Detection
Cluster nodes: find groups. Algorithms: Louvain, label propagation. Application: social network analysis.
Indexing and Optimization
Node Indexing
By label: Person nodes indexed separately. By property: name, ID indexed. Compound: (label, property) combinations.
Edge Indexing
By type: KNOWS edges indexed. By direction: outgoing/incoming. Local: from specific node.
Query Optimization
Selectivity: filter early (most selective predicates first). Cardinality: start with smallest result set. Join order: similar to RDBMS, optimized traversal.
Adjacency Lists
Implicit: graphs naturally support adjacency. Fast: one-hop lookups O(1) or O(log n). Compared to joins: far superior.
Caching
Frequently traversed: keep in cache. Path caching: remember common paths. Prefetching: load likely next nodes.
Graph Database Implementations
Neo4j
Most popular: property graph model. Cypher: native query language. ACID: transactions supported. Community and Enterprise editions.
Amazon Neptune
AWS managed: serverless, scalable. Multi-model: property graphs and RDF. Compatible: Gremlin and SPARQL APIs.
ArangoDB
Multi-model: graph, document, key-value. ArangoDB Query Language (AQL). Distributed: sharding, replication.
TigerGraph
Enterprise: large-scale graphs. Native: billions of edges. GSQL: graph query language. Real-time: sub-second traversals.
Comparison
| System | Model | Query | Scale |
|---|---|---|---|
| Neo4j | Property Graph | Cypher | Billions |
| Neptune | Property/RDF | Gremlin/SPARQL | Billions+ |
| TigerGraph | Property Graph | GSQL | Trillions+ |
Design Patterns
Relationship Modeling
One-to-many: edges represent relationships (natural)Example: Person -[WORKS_FOR]-> CompanyDenormalization: properties on edgesExample: -[WORKS_FOR {since: 2020, role: "Engineer"}]->Intermediate nodes: reify relationshipsExample: Person -[ASSIGNED]-> Task -[ASSIGNED_TO]-> ProjectType Hierarchies
Labels: inheritance via labels. Multiple: node can have multiple types. Subtypes: specialization (Person -> Employee -> Manager).
Time-Traveling
Temporal nodes: historical snapshots. Temporal edges: when relationships existed. Queries: at specific timestamps.
Recommendation Patterns
Collaborative filtering: similar users -> similar itemsUser A -[LIKES]-> Item XUser B -[LIKES]-> Item XUser A -[?]-> Item Y (recommend Y to A if B likes Y)Content-based: similar itemsItem X -[SIMILAR]-> Item YReal-World Applications
Social Networks
Facebook, LinkedIn: billions of nodes/edges. Friend recommendations: relationship traversal. News feeds: connected users.
Knowledge Graphs
Google, Wikipedia: concept networks. Entity resolution: linking real-world entities. Inference: semantic reasoning. Search enhancement.
Recommendation Systems
Netflix, Amazon: collaborative filtering. Personalization: user-product graphs. Real-time: fast traversal essential.
Fraud Detection
Financial networks: detect rings, unusual patterns. Relationships: money flows, accounts. Pattern matching: graph queries.
Identity and Access Management
Permissions: resource ownership, group membership. Roles: hierarchical relationships. Policy: graph-based access control.
Master Data Management
Integration: consolidate data sources. Relationships: link duplicate records. Data quality: ensure consistency.
Performance Characteristics
Traversal Speed
One-hop lookup: O(1) or O(log n) with index. Multi-hop path: O(branching factor ^ depth). Compared to joins: dramatically faster (no Cartesian product).
Memory Usage
Adjacency lists: efficient storage. Compared to normalization: less redundancy. Cache: frequently traversed nodes in memory.
Query Latency
| Query Type | Latency | Notes |
|---|---|---|
| Single node lookup | < 1 ms | Indexed |
| One-hop traversal | < 5 ms | Direct access |
| Multi-hop path | < 100 ms | Limited depth |
| Full graph scan | Minutes-hours | All nodes/edges |
Scalability
Horizontal: sharding by node/partition. Vertical: in-memory caching. Billions of nodes: proven. Petabyte-scale: specialized systems (TigerGraph).
References
- Robinson, I., Webber, J., and Eifrem, E. "Graph Databases: New Opportunities for Connected Data." O'Reilly Media, 2nd edition, 2015.
- Angles, R., and Gutierrez, C. "Survey of Graph Database Models." ACM Computing Surveys, vol. 40, no. 1, 2008, pp. 1-39.
- Kleppmann, M. "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems." O'Reilly Media, 2017.
- Harman, K. "Getting Started with Graph Databases." O'Reilly Media, 2014.
- Bondy, J. A., and Murty, U. S. R. "Graph Theory." Springer, Graduate Texts in Mathematics, 2008.