Distributed Failure Analyzer
Diagnose distributed system failures caused by network faults, unreliable clocks, or process pauses — and map each to its correct mitigation. Use when: a node is intermittently timing out with no clear network outage; a lock-holder or leader keeps acting after being declared dead (zombie leader / split brain via distributed locking, not replication topology — use replication-failure-analyzer for replica split brain); stale reads persist beyond expected replication lag; wall-clock-based lease checks or last-write-wins conflict resolution is producing data loss under clock skew; or cascading node-death declarations are occurring under load. Also use proactively to audit timing assumptions in new system designs (absence of fencing tokens, NTP drift exposure, GC pause risk). Distinct from replication-failure-analyzer (replication lag anomalies, failover pitfalls, quorum edge cases). Produces a structured failure report: symptom → fault category → mechanism → mitigation. Covers: asynchronous network behavior, timeout tuning and cascade risk, NTP drift and clock jump mechanics, process pause causes (GC, VM migration, paging, SIGSTOP), fencing tokens with ZooKeeper zxid/cversion, Byzantine fault scoping, and system model selection (crash-stop vs. crash-recovery vs. Byzantine; synchronous vs. partially synchronous vs. asynchronous).
