- Published on
- ·5 min read
Cache Made Consistent: Meta's Cache Invalidation Solution
- Authors

- Name
- Shahbaz Zaidi
- @_zaidi_shahbaz
The Problem Everyone Ignores
"There are only two hard things in Computer Science: cache invalidation and naming things." This quote gets laughed at in interviews but represents a real problem at Meta's scale.
Meta runs one of the largest caching infrastructures in the world. TAO (their social graph cache) handles billions of reads per second. When a user updates their profile or posts content, every cached copy of that data must be invalidated. Miss one cache and users see stale data.
The naive approach is to invalidate all caches when data changes. But with thousands of cache servers across multiple regions, invalidation messages can be delayed, lost, or arrive out of order. A cache might serve stale data for minutes or hours.
Consistency Model
Meta wanted a consistency model where cache reads always return data at least as fresh as the most recent write the client has observed. This is called "read-your-writes" consistency.
Formally, if a client performs write at time and later reads at time , the read must return either:
- The result of , or
- The result of a write that happened after
This seems obvious but is surprisingly hard to guarantee when:
- Caches are distributed across continents
- Database replication is asynchronous
- Network partitions happen regularly
Version Tracking
Polaris assigns a monotonically increasing version number to each cached object. When data is written to the database, it gets a version . Any cache holding version is stale.
The version is a logical timestamp derived from the database's write-ahead log (WAL) position. Every write to MySQL increments the log sequence number (LSN). Polaris uses this LSN as the version.
Invalidation Pipeline
When a write occurs:
- The database records the write and its version
- An invalidation event is published: (key, )
- Polaris delivers the event to all cache regions
- Each cache server compares with its local version
- If local version is less than , the entry is invalidated
The challenge is ensuring events are delivered reliably. Polaris uses a combination of persistent queues and polling. Caches that miss an invalidation event will eventually discover staleness through periodic version checks.
Handling Out-of-Order Delivery
Network delays can cause invalidation events to arrive out of order. Consider:
- Write with version 100 happens
- Write with version 101 happens
- Cache receives invalidation for version 101
- Cache receives invalidation for version 100 (delayed)
If the cache blindly processes events in arrival order, the second event might incorrectly invalidate data that was already updated by the first event.
Polaris solves this by comparing versions before invalidating. An invalidation for version only takes effect if the cache's current version is less than . The delayed event for version 100 is a no-op because the cache already knows about version 101.
Cross-Region Consistency
Meta's infrastructure spans multiple geographic regions. A user in Europe might write data that is read by a user in Asia. The European write must propagate to Asian caches.
Database replication is asynchronous. The Asian replica might lag behind the European primary by several seconds. If the Asian cache queries its local replica, it might get stale data and cache it.
Polaris tracks replication lag per region. When serving a read, the system checks if the local replica has caught up to the version the client expects. If not, the request is routed to a region with fresher data.
Cache Stampede Prevention
When a popular cache entry is invalidated, many concurrent requests might try to repopulate it simultaneously. Each request queries the database, wasting resources.
Polaris uses a lease mechanism. The first request to find an empty cache entry acquires a lease and is responsible for fetching from the database. Subsequent requests wait for the lease holder to populate the cache.
The lease has a timeout. If the lease holder crashes or is slow, another request can acquire the lease after timeout. This prevents deadlock while minimizing database load.
Operational Lessons
Meta found that most consistency bugs came from edge cases:
- Cache servers restarting with cold caches
- Network partitions isolating a cache from invalidation events
- Clock skew between regions causing version comparisons to fail
Polaris includes monitoring that detects when a cache's version falls too far behind. Alerts fire before users notice staleness. The system can automatically warm cold caches by pre-fetching popular entries.
The paper emphasizes that cache consistency is a systems problem, not an algorithm problem. The algorithms are straightforward. The difficulty is making them work reliably across thousands of servers with varying failure modes.