The ABCs of etcd

Bibin Wilson
February 04, 2025

One of the key focus areas for DevOps engineers is System Design.

In this edition, I want to share some essential design concepts related to etcd, a critical component of Kubernetes.

Before etcd became a core part of Kubernetes, I had set up an etcd cluster for service discovery (similar to Consul). Interestingly, etcd was created in 2013, a year before Google launched Kubernetes in 2014.

This edition will help you answer important Kubernetes etcd design questions, such as:

How does Kubernetes store its state in etcd’s key-value storage?
Storage limitations of etcd?
How would you design an etcd cluster to handle the failure of two nodes?
When should you choose an external etcd cluster over a stacked topology?

Lets dive in.

Why etcd?

Kubernetes is a distributed system.

It requires a consistent, reliable, and highly available datastore to store cluster state, configuration, and metadata.

A distributed system like Kubernetes must prevent data inconsistencies, split-brain scenarios etc.

This makes etcd an ideal choice for Kubernetes.

etcd Architecture

etcd is a distributed, strongly consistent key-value store that serves as the single source of truth for Kubernetes. It follows the Raft consensus algorithm to maintain consistency across multiple nodes.

💡 Strong Consistency: In a distributed system, when an update is made to one node, strong consistency ensures that all other nodes in the cluster are updated immediately. This guarantees that all nodes reflect the same data at any given time.

Following are the key components

Leader-Follower Model: The leader node processes all writes, while followers replicate data.
Key-Value Storage: Kubernetes stores its entire state in etcd’s hierarchical key space. The data store is built on top of BboltDB, a fork of BoltDB, known for its high performance and reliability.
Watch Mechanism: Kubernetes continuously watches etcd for changes to apply updates in real-time.
gRPC API: Provides access to store and retrieve cluster state.

etcd Database

etcd Storage Space is Limited. The default storage size limit is 2 GiB

The recommended maximum etcd data store size is 8 GiB.

etcd used Multi-Version Concurrency Control (MVCC).

Meaning, every object update creates a new version instead of modifying the existing one.

This append-only model may cause the database to grow over time because historical versions are retained.

Let's say you have an object that takes up xMB of space. When you update it, MVCC creates a new version instead of overwriting.

The old version's space is marked as deleted but not actually freed. Over time, you end up with many small pockets of unused space (Fragmentation)

To prevent the database from growing indefinitely, etcd supports Compaction. It removes old revisions of data that are no longer needed.

The k8s API server does compaction every 5 mins by default (configurable value)

Also, periodic Defragmentation reclaims disk space from deleted versions.

These are essential concepts for teams managing in-house, self-managed Kubernetes clusters, including etcd clusters.

When it comes to managed k8s services like EKS, there are workflows that does defragmentation when etcd runs out of space. (read from their official blog)

High Availability Architecture

Kubernetes clusters rely on etcd’s high availability (HA) architecture to prevent data loss.

There are two key deployment models for etcd HA.

Stacked etcd: Runs alongside control plane nodes (simpler, but less isolated).
External etcd: Etcd cluster running dedicated nodes. This model has the advantage of well-managed backup and restore options (better resilience and scalability).

Quorum & Fault Tolerance

💡 A quorum is the concept of minimum number of members required in a distributed system to make a decision, ensure consistency, or maintain reliability.

It is widely used in distributed databases, such as MongoDB clusters, Cassandra, and etcd (used in Kubernetes), as well as clustered architectures that require fault tolerance and consistency.

In etcd, quorum is used to ensure consistency and availability in the face of node failures.

The quorum is calculated as:

quorum = (n / 2) + 1

Where n is the total number of nodes in the cluster.

For instance, to tolerate the failure of one node, a minimum of three etcd nodes is required. To withstand two node failures, you would need at least five nodes, and so on.

The number of nodes in an etcd cluster directly affects its fault tolerance. Here's how it breaks down:

3 nodes: Can tolerate 1 node failure (quorum = 2)
5 nodes: Can tolerate 2 node failures (quorum = 3)
7 nodes: Can tolerate 3 node failures (quorum = 4)

And so on. The general formula for the number of node failures a cluster can tolerate is:

fault tolerance = (n - 1) / 2

💡 Adding more etcd nodes does not improve performance. Becuase writes require Raft agreement, so more nodes increase replication overhead.

Kubernetes recommends a five-node etcd cluster for optimal performance

Wrapping Up

In distributed systems, concepts like consistency, quorum, and fault tolerance are fundamental.

Even when setting up MongoDB clusters, you need to design them in a way that ensures node failures don’t compromise data integrity.

As DevOps engineers, these are essential concepts to understand. Usually, you get to apply them when working on real-world projects, especially when setting up systems from scratch.

I hope you found this edition useful!

In tomorrow’s edition, I’ll cover an interesting concept (split brain scenario) that is commonly asked in interviews and one you might encounter when designing clustered systems.

Stay tuned! 🚀

References

Reply

or to participate.