
Designing Data-Intensive Applications
Martin Kleppmann
14 skills extracted
Browse on ClawhHubData system decisions driven by buzzwords, defaults, or whatever the team used last time. No framework for evaluating trade-offs between replication strategies, consistency models, or processing paradigms. Debugging distributed system failures by guessing.
Systematic analysis of data system requirements producing scored technology recommendations with explicit trade-off documentation. Can evaluate replication topology, partitioning strategy, isolation level, and processing paradigm choices against specific workload characteristics. Produces architecture decision records with failure mode analysis.
Quality
Problems These Skills Solve
Choosing between replication strategies (single-leader, multi-leader, leaderless) for a specific workload
Selecting the right consistency and isolation guarantees without over- or under-provisioning
Designing partitioning schemes that avoid hotspots and support required query patterns
Choosing between batch and stream processing for data pipeline requirements
Working Environment
Skills operate on application codebases, infrastructure configurations, database schemas, and system architecture descriptions. Users are designing or evaluating data systems — choosing databases, replication strategies, partitioning schemes, consistency models, and data processing pipelines.
You provide
·their codebase or system description, database schemas, requirements documents, architecture diagrams
Install
Minimal
Foundation skills only — data models, storage, encoding, replication, transactions, distributed failures
Core
Foundation + diagnostic and processing skills
Full
All 14 skills including the capstone data integration architect
Extracted Skills

Batch Pipeline Designer
Design batch data processing pipelines for large-scale, bounded datasets processed offline. Use when building ETL workflows, processing logs or clickstream data at scale, generating ML feature pipelines or search indexes, or joining two large datasets that cannot fit in memory. Trigger phrases: "design a batch pipeline", "should I use Spark or MapReduce", "how do I join two large datasets", "build an ETL workflow", "process server logs at scale", "how do I handle skewed data in joins", "implement PageRank on a distributed graph", "design an offline processing job". Covers MapReduce vs dataflow engines (Spark, Flink, Tez), three join strategies (sort-merge, broadcast hash, partitioned hash) with selection criteria, graph processing via the Pregel/BSP model, and fault tolerance via materialization vs recomputation. Does not apply to unbounded input streams (see stream-processing-designer) or low-latency OLTP query serving. Produces a pipeline architecture recommendation with engine choice, join strategy, and fault tolerance approach.

Concurrency Anomaly Detector
Scan application code, SQL queries, or ORM code for exposure to the 6 database concurrency anomalies and produce a findings report with severity, affected locations, and fix recommendations. Use when: debugging a nondeterministic data corruption or race condition bug under concurrent load; auditing transaction code before deployment or after switching databases (isolation defaults differ across engines); a read-modify-write cycle or check-then-act pattern may be exposed to lost updates or write skew; an aggregate query (COUNT, SUM) guards an INSERT or UPDATE (phantom read exposure); or multiple tables are updated in one transaction without serializable isolation. Distinct from transaction-isolation-selector (which chooses the isolation level) — this skill scans code to find which anomalies existing code is already exposed to. Covers Python, Java, Go, JavaScript, Ruby; raw SQL; ORM code (SQLAlchemy, Hibernate, ActiveRecord, GORM); PostgreSQL, MySQL InnoDB, Oracle, SQL Server, and distributed databases. Maps code patterns (read-modify-write, SELECT/INSERT pairs, cross-table boundaries, snapshot boundary reads) to anomaly type, trigger conditions, and minimum fix (isolation upgrade vs. application-level mitigation).

Consistency Model Selector
Choose the correct consistency model (linearizability, causal consistency, or eventual consistency) for each operation in a distributed system, and select the matching implementation mechanism. Use when designing a new distributed data system, deciding whether ZooKeeper or etcd is needed for coordination, evaluating whether two-phase commit is appropriate for cross-node transactions, debugging correctness violations (stale reads, split-brain, uniqueness constraint failures), or distinguishing linearizability from serializability. Also use when applying the CAP theorem correctly (beyond the "pick 2 of 3" oversimplification), selecting total order broadcast as a consensus primitive, evaluating 2PC failure modes and lock-holding cost, or assessing whether causal consistency is sufficient in place of linearizability. Produces a per-operation consistency recommendation with replication mechanism, ordering guarantee, and — when consensus is needed — protocol selection (Raft, Zab, Paxos) with documented failure modes. Does not cover replication topology or failure recovery strategy (see replication-strategy-selector, distributed-failure-analyzer).

Data Integration Architect
Design the integration architecture for systems with multiple specialized data stores (Postgres, Elasticsearch, Redis, data warehouses) that must stay in sync. Use when deciding how data flows between components, avoiding dual writes, reasoning about correctness across system boundaries (idempotency, end-to-end operation identifiers), choosing between Lambda and Kappa architecture, or applying the "unbundling databases" pattern to compose specialized tools instead of relying on a single monolith. Trigger phrases: "how do I keep Postgres and Elasticsearch in sync?", "should I use CDC or event sourcing to propagate data?", "how do I avoid dual writes across microservices?", "my downstream systems are going out of sync — how do I fix the architecture?", "how do I design derived data pipelines?", "what is the system of record pattern?", "how do I integrate OLTP with a search index and an analytics warehouse?", "how do I design for end-to-end idempotency?". This is the capstone skill for data systems design — it synthesizes batch pipelines, stream integration, consistency, and replication into a single architecture recommendation. Produces a component map (systems of record vs derived views), data flow diagram, and correctness analysis. Does not replace batch-pipeline-designer or stream-processing-designer — delegates to them for pipeline internals.

Data Model Selector
Choose between relational, document, and graph data models for an application by analyzing data shape, relationship complexity, and query patterns. Use when asked "should I use MongoDB or PostgreSQL?", "when does a graph database make sense?", "how do I choose between SQL and NoSQL?", or "what data model fits my access patterns?" Also use for: evaluating impedance mismatch between data model and application code; deciding schema-on-read vs. schema-on-write for heterogeneous data; diagnosing whether many-to-many relationships call for relational or graph model; choosing between property graphs and triple-stores; deciding when polyglot persistence is appropriate. Produces a concrete recommendation with trade-off analysis — not "it depends." Covers relational (PostgreSQL, MySQL), document (MongoDB, CouchDB), and graph (Neo4j, Datomic) models including schema enforcement strategies and data locality trade-offs. For storage engine internals (LSM-tree vs B-tree), use storage-engine-selector instead. For OLTP vs. analytics routing, use oltp-olap-workload-classifier instead.

Distributed Failure Analyzer
Diagnose distributed system failures caused by network faults, unreliable clocks, or process pauses — and map each to its correct mitigation. Use when: a node is intermittently timing out with no clear network outage; a lock-holder or leader keeps acting after being declared dead (zombie leader / split brain via distributed locking, not replication topology — use replication-failure-analyzer for replica split brain); stale reads persist beyond expected replication lag; wall-clock-based lease checks or last-write-wins conflict resolution is producing data loss under clock skew; or cascading node-death declarations are occurring under load. Also use proactively to audit timing assumptions in new system designs (absence of fencing tokens, NTP drift exposure, GC pause risk). Distinct from replication-failure-analyzer (replication lag anomalies, failover pitfalls, quorum edge cases). Produces a structured failure report: symptom → fault category → mechanism → mitigation. Covers: asynchronous network behavior, timeout tuning and cascade risk, NTP drift and clock jump mechanics, process pause causes (GC, VM migration, paging, SIGSTOP), fencing tokens with ZooKeeper zxid/cversion, Byzantine fault scoping, and system model selection (crash-stop vs. crash-recovery vs. Byzantine; synchronous vs. partially synchronous vs. asynchronous).

Encoding Format Advisor
Select a data encoding format (JSON, Protobuf, Thrift, or Avro) and design a schema evolution strategy that preserves backward and forward compatibility through rolling upgrades. Use when asked "should I use Protobuf or JSON?", "how do I evolve my schema without breaking old clients?", "how does Avro schema evolution work?", "what's the difference between Thrift and Protocol Buffers?", or "how do I add/remove fields without breaking compatibility?" Also use for: choosing text vs. binary encoding for internal services; checking whether a schema change breaks compatibility; diagnosing unknown field loss bugs during rolling upgrades; planning per-dataflow encoding strategy (database storage vs. REST/RPC vs. message broker). Covers five encoding families: language-specific, JSON/XML/CSV, binary JSON, Thrift/Protobuf, and Avro — with writer/reader schema reconciliation and per-dataflow-mode analysis. For data model selection (relational/document/graph), use data-model-selector instead. For message broker or stream pipeline design, use stream-processing-designer instead.

Oltp Olap Workload Classifier
Classify a data workload as OLTP, OLAP, or hybrid, then recommend the appropriate database architecture — transactional database, dedicated data warehouse, or both with an ETL pipeline. Use when asked "should I use a data warehouse?", "why are my analytics queries slow on my production database?", "should I use Redshift/BigQuery/Snowflake?", or "can one database handle both transactions and reporting?" Also use for: designing star or snowflake schemas for analytics; deciding when column-oriented storage is appropriate; planning ETL pipeline structure between operational and analytical systems; evaluating whether HTAP (hybrid) databases fit a workload. For choosing between relational/document/graph models, use data-model-selector instead. For storage engine internals (LSM-tree vs B-tree), use storage-engine-selector instead. For batch/stream pipeline design, use batch-pipeline-designer or stream-processing-designer instead.

Partitioning Strategy Advisor
Select the right partitioning (sharding) strategy — range, hash, or compound key — and configure secondary indexes, rebalancing, and request routing for a distributed database. Use when: designing a partition key for a new system; diagnosing write hotspots on monotonically increasing keys (timestamps, auto-increment IDs); evaluating whether an existing sharding scheme supports required query patterns; choosing between document-partitioned (local) vs. term-partitioned (global) secondary indexes and weighing scatter/gather read costs against global index write amplification; or selecting a rebalancing approach (fixed partitions, dynamic partitions, proportional-to-nodes) and routing topology (gossip, ZooKeeper coordination, partition-aware client). Covers Cassandra compound primary key patterns for range queries within hash-distributed partitions, HBase/SSTables range partitioning, Riak consistent hashing, and MongoDB/Elasticsearch index partitioning. Distinct from replication-strategy-selector (topology and consistency) and data-model-selector (schema design). Produces a concrete recommendation: partition key, partitioning method, secondary index approach, rebalancing configuration, and routing topology. Depends on data-model-selector for schema and access pattern context.

Replication Failure Analyzer
Diagnose active replication failures by mapping symptoms to leader failover pitfalls, replication lag anomalies, or quorum edge cases — and produce a structured remediation plan. Use when: data just written disappears or shows stale on re-read (read-after-write violation); records appear and vanish on refresh (monotonic reads violation); causally related events appear in impossible order (consistent prefix reads violation); a failover produces duplicate primary keys, write rejections, or incorrect routing; two replica nodes are both accepting writes (split brain in replication topology — for split brain via distributed locking, use distributed-failure-analyzer); quorum reads return stale values despite w + r > n; or a sloppy quorum with incomplete hinted handoff is serving old data. Applies to PostgreSQL, MySQL (single-leader), Cassandra, Riak, Voldemort, DynamoDB (leaderless). Use replication-strategy-selector first if the topology has not yet been chosen. Produces: symptom → failure class → mechanism → mitigation report, leader failover checklist, replication lag anomaly guide, and quorum edge case catalog (six ways w + r > n still fails).

Replication Strategy Selector
Choose a replication topology (single-leader, multi-leader, or leaderless) and configure it correctly — including sync vs. async mode, quorum parameters (w + r > n), and consistency guarantees. Use when designing replication for a new system, configuring quorum values for Cassandra/Riak/DynamoDB, deciding how to handle multi-leader write conflicts, or comparing PostgreSQL/MySQL streaming replication vs. CouchDB multi-leader vs. Cassandra leaderless for your architecture. Also use for: selecting a conflict resolution strategy (last-write-wins vs. version vectors); designing multi-datacenter replication; choosing between WAL shipping, logical replication, and statement-based replication log formats. For diagnosing an existing replication failure (failover gone wrong, lag spike, quorum misconfiguration, split brain), use replication-failure-analyzer instead. For consistency model selection (eventual vs. causal vs. linearizable), use consistency-model-selector instead. For partitioning strategy, use partitioning-strategy-advisor instead.

Storage Engine Selector
Select the right storage engine architecture (LSM-tree, B-tree, or in-memory) for a database workload using a 7-dimensional scored trade-off analysis. Use when evaluating RocksDB vs InnoDB vs LevelDB, diagnosing write amplification in production, choosing between write-optimized vs read-optimized storage, selecting a compaction strategy (size-tiered vs leveled), or deciding whether to skip disk with an in-memory database. Also use for: comparing Cassandra vs PostgreSQL storage internals; justifying an existing engine choice to a team; assessing whether compaction pauses are causing latency spikes. Covers LSM-tree family (LevelDB, RocksDB, Cassandra, HBase), B-tree family (PostgreSQL, MySQL InnoDB, SQLite), and in-memory stores (Redis, Memcached, VoltDB). For choosing between relational/document/graph models, use data-model-selector instead. For OLTP vs. analytics routing, use oltp-olap-workload-classifier instead. For replication topology, use replication-strategy-selector instead.

Stream Processing Designer
Design a stream processing system for unbounded, continuously arriving data. Use when choosing a message broker (Kafka vs RabbitMQ), implementing change data capture (CDC) from PostgreSQL, MySQL, or MongoDB via Debezium or Maxwell, selecting window types for aggregation (tumbling, hopping, sliding, session), joining event streams or enriching events from a table, or configuring exactly-once fault tolerance. Trigger phrases: "should I use Kafka or RabbitMQ?", "how do I sync my database to Elasticsearch in real time?", "how do I implement CDC for Postgres?", "how do I get exactly-once semantics in Flink or Kafka Streams?", "should I use Lambda or Kappa architecture?", "how do I keep derived data systems in sync without dual writes?", "how do I join two event streams?". Covers log-based vs. traditional broker selection, four window types, three join types (stream-stream, stream-table, table-table), CDC bootstrap strategy, and microbatching vs. checkpointing trade-offs. Does not apply to bounded offline datasets (see batch-pipeline-designer) or multi-store integration architecture (see data-integration-architect).

Transaction Isolation Selector
Choose the correct transaction isolation level and serializability implementation for an application's concurrency patterns. Use when: selecting an isolation level for a new system; evaluating whether read committed or snapshot isolation is safe for your access patterns; deciding whether to upgrade to serializable and choosing between two-phase locking (2PL) vs. serializable snapshot isolation (SSI); producing an architecture decision record for isolation level choice; or explaining to a team why the database default is insufficient. Distinct from concurrency-anomaly-detector (which scans code for exposed anomalies) — this skill selects the level, not the bugs. Covers PostgreSQL, MySQL InnoDB, Oracle, SQL Server, and distributed databases. Applies a 6-anomaly × 4-isolation-level mapping matrix (dirty read, dirty write, read skew, lost update, write skew, phantom read vs. read uncommitted, read committed, snapshot isolation, serializable) to produce a concrete recommendation with implementation trade-off analysis. Works on any codebase, schema, or workload description.
Browse on ClawhHub