Using Apache Kafka to Process 1 Trillion Messages

Scale Context

Cloudflare processes HTTP requests for millions of websites. Every request generates telemetry: logs, analytics, security events. At peak, their infrastructure handles 1 trillion messages per day through Kafka.

That works out to roughly:

\frac{10^{12} \text{ messages}}{86400 \text{ seconds}} \approx 11.5 \text{ million messages/second}

Not all messages are created equal. Some carry kilobytes of log data, others are small metric updates. The aggregate bandwidth through Kafka exceeds 1 Tbps at peak.

Kafka Cluster Topology

Cloudflare runs multiple Kafka clusters, each dedicated to a specific use case. Analytics data flows to one cluster, security logs to another, customer-facing logs to a third. This separation prevents a misbehaving producer in one domain from affecting others.

Each cluster runs dozens of brokers. Replication factor is typically 3, meaning each partition exists on three brokers. For a topic with $P$ partitions and replication factor $R$ , total partition-replicas is $P \times R$ .

The number of partitions determines parallelism. If you have 100 partitions, you can have up to 100 consumers processing in parallel. Cloudflare uses partition counts in the hundreds for high-throughput topics.

Producer Batching

Sending one message at a time incurs massive overhead. Each send requires network round trips, broker disk writes, and acknowledgment delays. Batching amortizes this overhead across many messages.

Kafka producers accumulate messages in memory before sending. Two parameters control batching:

batch.size: Maximum bytes per batch (default 16KB)
linger.ms: Maximum time to wait for batch to fill (default 0)

For throughput, Cloudflare tunes linger.ms to allow batches to grow before sending. A 5ms linger might batch 1000 messages that would otherwise require 1000 network round trips.

The tradeoff is latency. A message might wait up to linger.ms before transmission. For analytics data where seconds of delay are acceptable, aggressive batching makes sense.

Compression

Kafka supports compression at the batch level. Before sending, the producer compresses the entire batch. The broker stores compressed data and consumers decompress on read.

Compression ratios for log data are impressive. LZ4 typically achieves 4-5x compression on JSON logs. For 1 trillion messages, that translates to hundreds of terabytes saved in storage and network bandwidth.

\text{Effective Throughput} = \text{Network Bandwidth} \times \text{Compression Ratio}

Cloudflare uses LZ4 for its balance of compression ratio and CPU efficiency. Zstd achieves better ratios but costs more CPU cycles.

Consumer Groups and Partition Assignment

A consumer group is a set of consumers that cooperate to consume a topic. Each partition is assigned to exactly one consumer in the group. If you have 100 partitions and 10 consumers, each consumer handles 10 partitions.

When consumers join or leave, Kafka rebalances partition assignments. During rebalancing, consumption pauses. Frequent rebalances hurt throughput.

Cloudflare minimizes rebalances by:

Running stable consumer deployments (rolling updates, not mass restarts)
Setting session.timeout.ms high enough to tolerate GC pauses
Using static group membership where consumers have persistent identities

Exactly-Once Semantics

By default, Kafka provides at-least-once delivery. A producer crash after sending but before receiving acknowledgment causes a duplicate on retry. Consumers that crash after processing but before committing offsets reprocess messages.

Kafka's exactly-once feature uses transactions. A producer begins a transaction, sends messages to multiple partitions, then commits. Either all messages are visible or none are.

For consumers, exactly-once requires that offset commits and message processing occur in the same transaction. This works when the output is also a Kafka topic (consume-transform-produce pattern).

Cloudflare uses exactly-once for pipelines where duplicates cause incorrect aggregations. For logging use cases, at-least-once suffices and avoids transaction overhead.

Monitoring at Scale

With trillion-message volumes, small inefficiencies compound. A 1% increase in message size costs 10 billion extra bytes daily. Cloudflare monitors:

Lag: How far behind are consumers? Lag measured in messages or seconds.
Under-replicated partitions: Brokers that haven't caught up with the leader.
Request latency percentiles: p50, p99, p999 for produce and fetch requests.

Consumer lag is particularly important. If a consumer falls behind during a traffic spike and can't catch up, you have a capacity problem. Lag should trend toward zero during normal operation.

Key Takeaways

Kafka scales linearly with partitions and brokers. Add more partitions for parallelism, add more brokers for storage and throughput. But linear scaling requires careful attention to producer batching, compression, and consumer group stability.

The architecture separates concerns cleanly. Producers don't know about consumers. Brokers don't parse message content. This separation enables Cloudflare to run heterogeneous pipelines on shared infrastructure.