Graph Databases | What's Your IQ

Introduction

Graph database: specialized database optimizing connected data representation. Native: stores relationships explicitly (not joins). Queries: traverse connections directly (not computed). Examples: Neo4j, ArangoDB, Amazon Neptune. Advantage: fast traversal, expressive queries. Use: social networks, knowledge graphs, recommendations.

"Graph databases realize a fundamental truth: the world is connected. Every entity has relationships to others. Graph databases make these connections first-class citizens, not afterthoughts, enabling queries impossible in traditional databases." -- Graph theory

Graph Database Definition

Core Concept

Graph: mathematical structure (nodes, edges). Database: persistent storage with query. Native: traversal native operations (not joins). Property: attributes on nodes/edges. Directed/undirected: edges have direction or not.

Difference from Relational

RDBMS: relationships via foreign keys (computed, not stored). Joins: expensive (multiple table scans). Graph: relationships explicit (stored, traversed). Paths: natural, efficient.

Graph Theory Terminology

Vertex/Node: entity (person, place, thing). Edge/Relationship: connection (knows, likes, owns). Property: attribute (name, weight, type). Degree: number of edges connected to node. Path: sequence of edges.

Examples

Social Network: Nodes: people (Alice, Bob) Edges: friendships (Alice--knows-->Bob) Properties: name, age, timestampKnowledge Graph: Nodes: concepts, entities Edges: relationships (hypernym, meronym) Properties: definitions, descriptions

Graph Data Model

Property Graph Model

Most common: nodes and edges have properties. Key-value pairs: define attributes. Type: labeled (nodes, edges). Flexible: schema-optional.

Triple/RDF Model

Subject-predicate-object: semantic web standard. (Alice, knows, Bob). Reasoning: inference rules. Standardized: W3C SPARQL query language.

Hypergraph Model

Edges connect multiple nodes (not just pairs). More expressive: model complex relationships. Less common: fewer implementations.

Comparison

Model	Structure	Use
Property Graph	Nodes + Edges + Properties	Social, Recommendations
RDF	Triples (SPO)	Semantic Web, Reasoning
Hypergraph	Multi-node edges	Complex domains

Nodes and Edges

Nodes (Vertices)

Entities: person, place, product. Labels: classification (Person, City, Movie). Properties: attributes (name, location, year). Relationships: connected to other nodes via edges.

Edges (Relationships)

Connections: between nodes. Types: KNOWS, LIKES, OWNS (labeled). Direction: source -> target (directed). Weight: strength or cost (optional).

Example Structure

Node Alice (Person): Properties: name="Alice", age=30, city="NYC" Edges: KNOWS, WORKS_FOR, LIKESNode Bob (Person): Properties: name="Bob", age=25, city="SF" Edges: KNOWS, WORKS_FOR, LIKESEdge KNOWS (Alice -> Bob): Properties: since=2020, strength=0.8 Direction: Alice knows Bob (may or may not be mutual)

Cardinality

One-to-many: one node to many edges. Many-to-many: natural (just edges). No joins: relationships are direct.

Graph Properties and Attributes

Node Properties

Intrinsic: describe entity. Example: Person node (name, age, email). Indexed: enable fast lookup. Optional: schema-flexible.

Edge Properties

Relationship attributes: since date, confidence, weight. Metadata: when created, who created. Support: traversal filtering.

Labels and Types

Node labels: Person, Organization, LocationEdge types: KNOWS, WORKS_FOR, LOCATED_INQuery: MATCH (p:Person)-[:KNOWS]->(friend:Person)-- Find all people and their friends (labeled nodes, typed edges)

Indexing Properties

Primary: on frequently queried columns. Secondary: support filtering. Compound: multiple properties together. Performance: critical for large graphs.

Constraints

Unique: property must be unique (ID). Required: property must exist. Type: enforce data type. Referential: edge endpoints must exist.

Graph Query Languages

Cypher (Neo4j)

-- Match pattern: Alice knows BobMATCH (a:Person {name:"Alice"})-[:KNOWS]->(b:Person)RETURN b.name-- More complex: friends of friendsMATCH (a:Person {name:"Alice"})-[:KNOWS]->()-[:KNOWS]->(fof)RETURN DISTINCT fof.name

Gremlin (Traversal Language)

g.V().has('name', 'Alice') .out('KNOWS') .out('KNOWS') .values('name')-- More imperative: step-by-step traversal

SPARQL (RDF Databases)

PREFIX foaf: SELECT ?name WHERE { ?alice foaf:name "Alice" . ?alice foaf:knows ?friend . ?friend foaf:name ?name .}

Comparison

Cypher: intuitive, pattern matching. Gremlin: flexible, procedural. SPARQL: standardized, semantic. Choice: depends on system and preference.

Graph Traversal Algorithms

Depth-First Search (DFS)

Explore graph: go deep first (follow single path far)Use: reachability, cycles, topological sortImplementation: recursive or stack-based

Breadth-First Search (BFS)

Explore graph: go wide first (visit neighbors level by level)Use: shortest path, level-order traversalImplementation: queue-basedExample: social network (distance from person)

Shortest Path (Dijkstra)

Find minimum-cost path between nodesUse: navigation, routing, recommendationsRequirement: non-negative weights

PageRank and Centrality

Importance: how central is a node? Used: recommendation, influence analysis. Iterative: based on incoming connections.

Community Detection

Cluster nodes: find groups. Algorithms: Louvain, label propagation. Application: social network analysis.

Indexing and Optimization

Node Indexing

By label: Person nodes indexed separately. By property: name, ID indexed. Compound: (label, property) combinations.

Edge Indexing

By type: KNOWS edges indexed. By direction: outgoing/incoming. Local: from specific node.

Query Optimization

Selectivity: filter early (most selective predicates first). Cardinality: start with smallest result set. Join order: similar to RDBMS, optimized traversal.

Adjacency Lists

Implicit: graphs naturally support adjacency. Fast: one-hop lookups O(1) or O(log n). Compared to joins: far superior.

Caching

Frequently traversed: keep in cache. Path caching: remember common paths. Prefetching: load likely next nodes.

Graph Database Implementations

Neo4j

Most popular: property graph model. Cypher: native query language. ACID: transactions supported. Community and Enterprise editions.

Amazon Neptune

AWS managed: serverless, scalable. Multi-model: property graphs and RDF. Compatible: Gremlin and SPARQL APIs.

ArangoDB

Multi-model: graph, document, key-value. ArangoDB Query Language (AQL). Distributed: sharding, replication.

TigerGraph

Enterprise: large-scale graphs. Native: billions of edges. GSQL: graph query language. Real-time: sub-second traversals.

Comparison

System	Model	Query	Scale
Neo4j	Property Graph	Cypher	Billions
Neptune	Property/RDF	Gremlin/SPARQL	Billions+
TigerGraph	Property Graph	GSQL	Trillions+

Design Patterns

Relationship Modeling

One-to-many: edges represent relationships (natural)Example: Person -[WORKS_FOR]-> CompanyDenormalization: properties on edgesExample: -[WORKS_FOR {since: 2020, role: "Engineer"}]->Intermediate nodes: reify relationshipsExample: Person -[ASSIGNED]-> Task -[ASSIGNED_TO]-> Project

Type Hierarchies

Labels: inheritance via labels. Multiple: node can have multiple types. Subtypes: specialization (Person -> Employee -> Manager).

Time-Traveling

Temporal nodes: historical snapshots. Temporal edges: when relationships existed. Queries: at specific timestamps.

Recommendation Patterns

Collaborative filtering: similar users -> similar itemsUser A -[LIKES]-> Item XUser B -[LIKES]-> Item XUser A -[?]-> Item Y (recommend Y to A if B likes Y)Content-based: similar itemsItem X -[SIMILAR]-> Item Y

Real-World Applications

Social Networks

Facebook, LinkedIn: billions of nodes/edges. Friend recommendations: relationship traversal. News feeds: connected users.

Knowledge Graphs

Google, Wikipedia: concept networks. Entity resolution: linking real-world entities. Inference: semantic reasoning. Search enhancement.

Recommendation Systems

Netflix, Amazon: collaborative filtering. Personalization: user-product graphs. Real-time: fast traversal essential.

Fraud Detection

Financial networks: detect rings, unusual patterns. Relationships: money flows, accounts. Pattern matching: graph queries.

Identity and Access Management

Permissions: resource ownership, group membership. Roles: hierarchical relationships. Policy: graph-based access control.

Master Data Management

Integration: consolidate data sources. Relationships: link duplicate records. Data quality: ensure consistency.

Performance Characteristics

Traversal Speed

One-hop lookup: O(1) or O(log n) with index. Multi-hop path: O(branching factor ^ depth). Compared to joins: dramatically faster (no Cartesian product).

Memory Usage

Adjacency lists: efficient storage. Compared to normalization: less redundancy. Cache: frequently traversed nodes in memory.

Query Latency

Query Type	Latency	Notes
Single node lookup	< 1 ms	Indexed
One-hop traversal	< 5 ms	Direct access
Multi-hop path	< 100 ms	Limited depth
Full graph scan	Minutes-hours	All nodes/edges

Scalability

Horizontal: sharding by node/partition. Vertical: in-memory caching. Billions of nodes: proven. Petabyte-scale: specialized systems (TigerGraph).

References

Robinson, I., Webber, J., and Eifrem, E. "Graph Databases: New Opportunities for Connected Data." O'Reilly Media, 2nd edition, 2015.
Angles, R., and Gutierrez, C. "Survey of Graph Database Models." ACM Computing Surveys, vol. 40, no. 1, 2008, pp. 1-39.
Kleppmann, M. "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems." O'Reilly Media, 2017.
Harman, K. "Getting Started with Graph Databases." O'Reilly Media, 2014.
Bondy, J. A., and Murty, U. S. R. "Graph Theory." Springer, Graduate Texts in Mathematics, 2008.