I built a Distributed Key/Value Database in Golang as part of my learning distributed databases systems. It is a distributed, fault-tolerant key-value store in Go with consistent hashing, gossip-based failure detection, leader elections, and anti-entropy sync. It uses pebble as the data storage engine.
Why Build a Database?
Reading about distributed systems is one thing. building something actually helps you to understand it better. The initial phase of the project was me just reading research papers to understand the concept and technique used by large distributed databases. A few of the papers that i would recommend reading are
CockroachDB: The Resilient Geo-Distributed SQL Database
Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service
In Search of an Understandable Consensus Algorithm
The Google File System
This Project is purely for learning. It is not intended for any production uses. This database only support string data.
Storage Layer
for storage we used pebble data storage engine. Pebble is developed by CockraochDb team, It is a LSM tree based storage engine designed for heavy write workloads because writes go to an in-memory buffer first which gets flushed to disk in sorted runs. First i tried using RocksDB but it was little complex to setup so i migrated to pebble.
Understanding IrisDb Node Roles
Before diving to the architecture, it helps to understand the different roles a node can play in IrisDb.
a) Coordinator Node
This is the first node that bootstraps the cluster. It isn't a permanent node initially it is the first node that starts up and holds the initial metadata. For a node to join it should send a join request to the Coordinator node.
- Owns the full slot range (0–16383) when the cluster starts
- Manages the Two-Phase Join Protocol when new nodes want to join
- Validates incoming join requests and computes new slot allocations
- Broadcasts cluster configuration changes to all peers
b) Slot Range Master
Every slot range (0–16384slots) has exactly one master node and several replica nodes for that specific slot range.
- accepts all write requests (SET, DEL) for keys that fall in its slot range
- reads data locally from its pebble instance for GET requests
- replicates every write to its assigned replica nodes
- sends heartbeat responses to replicas
- runs anti-entropy checks every 60 seconds against its replicas
If you write a key and the receiving node is not the slot master for that key, it immediately redirects your request to the correct master. You never write to the wrong place.
c) Slot Range Replica
Every slot range also has one or more replica nodes assigned to it, depending on your configured replication factor.
What it does:
- Receives and stores replicated writes from the master
- Sends heartbeat pings to the master every 15 seconds
- Monitors master health and initiates failover if the master goes down
- Can be promoted to master if the slot range master fails
- Participates in anti-entropy sync to detect data drift
Sharding: Consistent Hashing with 16,384 Virtual Slots
This database uses a 16,384 Virtual key slot range model which is developed by redis and used in their clusters. Every key is mapped to a slot using CRC16.
slot = CRC16(key) % 16384
Each nodes in the cluster owns one or more range of key slots. When a new node joins the cluster, the node with the lowest resource score splits its range and hands half the slots to the new node. Why 16,384? It's large enough that slot ranges divide evenly across many nodes, but small enough that slot ownership metadata stays compact.
Detailed Request Routing & Replication Flow (Write/Read)
When a client sends commands it passes through a precise pipeline before the data is actually written or read. Below is the 5 phases of the pipeline.
1. Slot calculation
Every key is passed to a CRC16 function which maps the key to a range from 0 to 16384. This maps the key to a specific slot range which will be handled by some node x. same key always hits the same slot.
2. Local Metadata Lookup
The node checks its local slot range metadata to find out which node the slot belongs to.
3. Is This Node the Master?
- No → The request is immediately redirected to the correct master node. The client is transparent to this it just gets a response.
- Yes → The node handles the request directly.
4. Read vs Write
- Read (GET): The node reads the data from its local Pebble instance and returns the value.
- Write (SET/DEL): The node writes to master node first, then checks the replication factor and replicates the data to its replica node.
5. Replication
- Replication Factor = 0 → Write is complete. Return success immediately.
- Replication Factor > 0 → The master node looks up which replica nodes are assigned to this slot from the local metadata then sends a
REPLICATION CMDto each of them and waits for ACKs. - ACKs met → Return success to client.
- ACKs not met → Return error. The write is considered failed.
Detailed Two-Phase Node Join Protocol
The naive approach is to just send every node that node Z exists now but it doesn't work in real world. As it causes inconsistency in the cluster where different nodes disagrees on who owns which slots.
I implemented a Two-Phase Join Protocol:
Phase 1 Prepare: The joining node contacts the master Coordinator responsible for handling all the node joins and exits. It validates the request, computes new slot allocations, and sends a PREPARE message to all live nodes. No changes are committed yet.
Phase 2 Commit: Once every nodes acknowledge the prepare, the coordinator sends a COMMIT message to every node in the cluster. Only now do nodes update their slot ownership and begin routing traffic to the new node.
When a new Node joins the Master slot range and the replica slot ranges it handles are determined by resource score. It splits the slot range of the node which has the lowest resource score. making it redistribute its keys.
After successful join the newly joined server will update its local metadata and also receive the assigned slot ranges and the data.
Heartbeat, Leader Suspect, and Quorum Election Flow
In a distributed database environment we need to detect if a node is alive, dead or just slow. IrisDb solves this issues with 2 Techniques:
Stage 1: Heartbeats
Every node sends a heartbeat to the coordinator Master node every 15sec.
- Master responds in time → The replica resets its missed count to zero and everything is healthy.
- Master doesn't respond → The node increment a
failed_attemptscounter and waits for the next tick. if it crosses a certain threshold it will remove the node from the cluster.
Stage 2: Suspect Messages, Quorum Agreement & Leader Election
This is used by the nodes in the cluster to check if the master node is failed or not. If the master node is not responding the node will raise a SUSPECT_LEADER message to the next node with the highest Resource Score. When the node gets enough suspect messages it will send REQ_VOTE to all the peers in the cluster.
- if quorum is reached it will become the master coordinator.
- else Other nodes can still reach the master. The failover is cancelled this was likely a network partition isolated to few nodes. Heartbeats continue normally.
Anti-Entropy Checksum & Sync Flow
Even with synchronous replication, replicas can drift a network partition during a write, a disk flush issue, an edge case in the replication logic. This is called entropy, and it's insidious because it's silent.
Every 60 seconds, each master node of each slot range runs an anti-entropy check:
- Compute a XOR checksum + key count for all keys in its slot range
- Ask each replica for their checksum and key count
- If there's a discrepancy → trigger a background re-sync of the full range
XOR checksums are cheap to compute and good enough to detect divergence. If they match, we're good. If they don't, we know exactly which range needs to be re-synced.
Gossip & Resource Score Calculation
Most distributed databases treat all nodes as equal. IrisDb does not. A node running on a 2core VM with 4GB RAM shouldn't own the same number of slots as a 16core bare metal server with 128GB RAM. And a node that's been dropping 30% of its connections should not be trusted with replica placement.
IrisDb solves this with a dynamic resource scoring system that runs every 30 seconds on every node and propagates scores across the cluster using a gossip protocol.
Every 30sec the node measures its capabilities. No external agent or centralized metric server. All the nodes are self aware. The node reads three metrics directly from the OS, RAM, DISK and CPU.
CapacityScore = 0.5 × log₂(1 + RAM_GiB)
+ 0.3 × CPU_Cores
+ 0.2 × log₂(1 + Disk_GiB)
without logarithmic scaling a node with 1tb of disk space will dominate the score and monopolizes slot ranges. A weight of 0.5 is given to the RAM as in pebble data is cached in memtable before flushing to the disk. 0.3 is given to CPU as it determines compaction speed, connection handling, and election responsiveness. 0.2 for the disk it is important but less since SSD have a disk performance differences.
Hardware specs isn't enough to determine the overall score of a node. A node with great hardware but a flaky network connection is still a bad choice for slot assignment. Here the node tracks:
- Client success rate: what percentage of client facing requests completed successfully
- Peer success rate: what percentage of inter node bus communications succeeded
- Average latency: measured response time on the peer bus, converted to a 0–1 score (lower latency = higher score)
NetworkFactor = 0.4 × ClientSuccessRate
+ 0.4 × PeerSuccessRate
+ 0.2 × LatencyScore
Cold start handling: New nodes have no network traffic yet, so they start with a NetworkFactor of 1.0. This prevents a healthy new node from being penalized just because it hasn't served any traffic yet.
After computing the new score, the node updates its local stats and increments a version number. The version is critical for gossip. when a peer receives a score update, it checks whether the incoming version is newer than what it already has. Stale updates are silently dropped.
IrisDb doesn't use a centralized metrics server. Instead, it uses gossip. the same informal information-spreading mechanism humans use. The node sends its ResourceScoreEvent to a gossip channel which then randomly selects a few active peer nodes and exchanges scores with them.
Those peers update their internal node metadata maps, and on their next gossip cycle, they will spread the information further. Within a few rounds, every node in the cluster has an up-to-date picture of every other node's health without any single point of coordination.
This is sometimes called epidemic propagation. information spreads through the cluster the way a rumor spreads through a crowd. It's eventual consistency for cluster metadata.
Every 30 seconds this entire process runs on every node:
- Each node measures its own hardware and network health
- Computes a single representative score
- Gossips that score to a random subset of peers
- Peers propagate it further
The result is a cluster where every node has a continuously updated, metadata of the health of every other node with no central coordinator, no monitoring agent, and no single point of failure.
What I Learned
Distributed systems are all about failure, We can access the capability of the distributed environment on how it handles different types of scenarios like network partition, disk failure. Unit tests don't actually test the capability of the distributed environment. best way to test is using choas tests instead of unit tests, scripts that randomly kill nodes, introduce latency, and drop packets rather than relying on unit tests that always run in a clean environment. If I were to do this again, I'd write chaos tests from day one instead of treating them as an afterthought.
Many distributed databases uses consistent hashing plainly with considering the nodes capabilities. while it does have its advantages of maintaining little metadata but with replica nodes this quickly reduces the advantages as failure of node can lead to multiple redistribution of the keys. That why i used a cockroachDB slot based distribution integrated with ResourceScore to better distribute the slot ranges.
Building this project really humbled me. I started this proejct thinking i understood the theory but i realized theory is the easy part and the hard part is actually building the system. IrisDb is not production ready. It's a learning project, and proudly so. But every line of code in it taught me something that no paper or tutorial could.
Let's Connect
Link to the repository: github.com/leoantony72/irisDb
Have any questions about the architecture, design decisions, or anything else about IrisDb? Feel free to reach out
- Twitter/X: https://x.com/serpico_z
- Email: leoantony102@gmail.com