Heartbeats: Knowing a Node Is Alive
How heartbeats help systems detect failures without confusing slow with dead.
Articles page 6: more writing on software architecture, data, distributed systems, runtime choices, reliability, and engineering leadership.
How heartbeats help systems detect failures without confusing slow with dead.
A simple explanation of gossip protocols and why they are useful for spreading cluster state.
How fencing tokens protect shared resources when an old leader wakes up after a pause.
A practical explanation of eventual consistency and how to make delayed updates understandable to users.
Why deterministic processing makes replay, recovery, testing, and distributed debugging much easier.
A practical explanation of CAP theorem through the choice a system makes during a network partition.
A plain explanation of Byzantine faults, where a participant may lie, corrupt data, or send conflicting answers.
A practical explanation of availability, graceful degradation, and what users should still be able to do when part of a system is unhealthy.
A plain language glossary of distributed systems terms, from availability and CAP to Lamport time, consensus, ordering, serialized transactions, and ZooKeeper.