Maglev: A Fast and Reliable Software Network Load Balancer

The Problem

Every Google service receives traffic through load balancers. Hardware load balancers work fine at small scale but become bottlenecks when you need to handle millions of connections per second. They also create single points of failure.

Maglev is a software load balancer deployed on commodity Linux servers. Each Maglev machine announces the same IP address via BGP, and routers use ECMP (Equal Cost Multi-Path) to spray packets across all available Maglev instances. The challenge: how do you ensure that all packets belonging to the same connection reach the same backend, even when Maglev instances come and go?

Connection Consistency

When a client establishes a TCP connection, all packets in that connection must reach the same backend server. If packet 47 of a connection arrives at a different backend than packets 1-46, the backend will reset the connection because it has no state for that flow.

Traditional load balancers track connection state: "connection X goes to backend Y." But this approach fails when load balancer instances are added or removed. The new instance has no knowledge of existing connections.

Maglev uses consistent hashing to make the same forwarding decision regardless of which instance handles the packet. Given a packet's 5-tuple (source IP, source port, destination IP, destination port, protocol), any Maglev instance computes the same backend.

Maglev Hashing

Standard consistent hashing has a problem: when backends change, you want to minimize disruption. If you have 100 backends and one fails, ideally only 1% of connections should be affected. But ring-based consistent hashing doesn't guarantee this with small lookup tables.

Maglev constructs a lookup table $M$ of size $N$ where each entry maps to a backend. For a packet hash $h$ :

\text{backend} = M[h \mod N]

The table is constructed so that when a backend is added or removed, the minimum number of entries change. With $N = 65537$ (a prime), Maglev achieves near-optimal disruption.

Table Generation Algorithm

Each backend $b$ generates two sequences: a "permutation" that determines its preferred slots. For backend $i$ :

\text{offset}_i = h_1(\text{name}_i) \mod N

\text{skip}_i = h_2(\text{name}_i) \mod (N-1) + 1

Backend $i$ 's preference list is then:

\text{permutation}[i][j] = (\text{offset}_i + j \times \text{skip}_i) \mod N

The algorithm fills the table by letting each backend claim its most preferred empty slot in round-robin fashion. When a backend disappears, its slots are redistributed to other backends while preserving existing assignments as much as possible.

Connection Tracking for Edge Cases

Consistent hashing handles most cases, but what about when a backend becomes unreachable during an active connection? The backend's slots in the lookup table now point to a healthy backend, but existing connections shouldn't be migrated.

Maglev maintains a small connection tracking table that records recent 5-tuples and their assigned backends. This table acts as a cache: if a connection is found here, use the recorded backend (even if unhealthy, for graceful connection draining). If not found, use the lookup table.

The connection tracking table has fixed size and uses LRU eviction. Long-lived connections refresh their entries periodically.

Performance

Each Maglev instance handles about 10 million packets per second per core. The paper benchmarks show that Maglev can saturate a 10Gbps link with small packets using a single machine.

The lookup is essentially:

Hash the 5-tuple: $O(1)$
Check connection tracking table: $O(1)$
Lookup in Maglev table: $O(1)$

Total per-packet processing is constant time with small constants.

Why Not Hardware?

Hardware load balancers from vendors like F5 or Citrix cost hundreds of thousands of dollars, have fixed capacity, and require specialized expertise to manage. Maglev runs on commodity servers that Google already operates at scale.

When traffic spikes, Google can add more Maglev instances. When traffic drops, those machines can run other workloads. Hardware appliances just sit idle.

The software approach also enables rapid iteration. Google can deploy bug fixes or new features across all Maglev instances in hours. Hardware firmware updates are far more disruptive.

Lessons

The key insight is that consistent hashing makes stateless load balancing possible. Each Maglev instance can make correct forwarding decisions independently, without coordinating with other instances. This independence is what enables horizontal scaling.

Connection tracking is the pragmatic concession: some state is necessary for handling backend failures gracefully. But the state is local and lossy. If an instance loses its connection tracking table, connections eventually rehash to the right backend anyway.