How Discord Stores Trillions of Messages

The Numbers

Discord stores over 4 trillion messages. Users send billions more every day. Peak traffic is unpredictable because gaming communities can spike suddenly when a popular streamer goes live or a game launches.

The latency requirement is strict. When you scroll up in a channel, messages must appear instantly. Discord targets p99 read latency under 15 milliseconds. Write latency matters less because users don't notice if their sent message takes 50ms to persist.

Why Cassandra Initially

Discord chose Cassandra for its write performance and linear scalability. Cassandra handles high write throughput well because writes go to an in-memory structure (memtable) and flush to disk asynchronously.

The data model uses channel_id as the partition key:

PRIMARY KEY ((channel_id), message_id)

All messages in a channel land in the same partition. Fetching recent messages is a sequential read of one partition. This works beautifully for small channels.

The Hot Partition Problem

Some Discord channels have millions of messages. The #general channel of a popular server might receive thousands of messages per minute. All writes hit one partition, which lives on a few Cassandra nodes.

Cassandra partitions have a practical size limit. The official recommendation is under 100MB. Discord channels exceeded this by orders of magnitude.

Reading a hot partition requires scanning through tombstones (markers for deleted messages). Discord's message retention policies create many tombstones. A query that should return 50 messages might scan 100,000 tombstones first.

Cassandra's garbage collection also struggled. Compaction (merging SSTables and removing tombstones) caused latency spikes. During compaction, reads would occasionally take seconds instead of milliseconds.

Time-Bucketed Partitions

The fix was to bucket messages by time:

PRIMARY KEY ((channel_id, bucket), message_id)

Each bucket covers a time range (Discord uses roughly 10 days). A channel with 5 years of history has about 180 buckets instead of one massive partition.

Fetching recent messages now reads from the current bucket. Scrolling far back requires reading older buckets sequentially. Since users rarely scroll far back, most queries hit small, recent buckets.

The bucket boundary creates edge cases. A query spanning two buckets requires two partition reads. Discord's query layer handles this transparently.

Migration to ScyllaDB

Despite time-bucketing, Cassandra's tail latencies remained problematic. GC pauses and compaction spikes caused p99 latency to exceed targets during peak load.

ScyllaDB is a Cassandra-compatible database written in C++ with a shard-per-core architecture. Each CPU core owns a shard of data. There's no shared memory between cores, which eliminates locking overhead and GC pauses.

The migration required no data model changes. ScyllaDB speaks the Cassandra protocol. Discord ran dual writes to both databases during migration, gradually shifting reads to ScyllaDB.

Results were dramatic:

p99 read latency dropped from 40-125ms to 15ms
p99 write latency dropped from 5-70ms to 5ms
Tail latency variance nearly disappeared

Data Modeling Lessons

Discord's experience reinforces fundamental distributed database principles:

Partition key determines distribution. All data with the same partition key lives together. This is great for locality but dangerous for hot spots.

Time-series data needs time-bucketing. Unbounded growth in a single partition eventually hits limits. Bucket by time to bound partition size.

Tombstones are expensive. Deletes in LSM-tree databases create tombstones that persist until compaction. Design retention policies around compaction schedules.

Read Path Optimization

Discord caches heavily. A request for channel messages first checks an in-memory cache (Redis cluster). Cache hit ratio is high because users tend to read recent messages in channels they've recently visited.

On cache miss, the query hits ScyllaDB:

Determine which bucket(s) contain the requested time range
Query each bucket in parallel
Merge results and cache them

For popular channels, the cache absorbs most read traffic. ScyllaDB handles the long tail of less popular channels and cache misses.

\text{DB Load} = \text{Total Reads} \times (1 - \text{Cache Hit Rate})

A 95% cache hit rate reduces database load by 20x.

Write Path

Writes go directly to ScyllaDB, not through cache. The cache is invalidated (or updated) asynchronously. This means a user might not immediately see their own message on a different device, but the delay is under a second.

Messages are immutable after creation (edits create new versions). This simplifies caching because cached data doesn't become stale, only incomplete.

Reverse Index for Search

The channel-partitioned model is great for reading messages in order but terrible for search. "Find all messages containing 'bug' in this server" would require scanning every message.

Discord builds a separate search index using Elasticsearch. Messages flow through Kafka to both ScyllaDB (primary storage) and Elasticsearch (search index). The search index is eventually consistent with primary storage.

This separation of concerns keeps the primary storage model simple while enabling full-text search through a purpose-built system.