How Netflix Scales its API with GraphQL Federation

The Problem with API Gateways

Netflix runs hundreds of microservices. Each service owns a domain: user profiles, content metadata, recommendations, playback. Mobile and TV apps need data from multiple services to render a single screen.

Traditional API gateways aggregate data from backend services. But this creates a bottleneck. Every new field requires gateway changes. The gateway team becomes a coordination point that slows everyone down.

REST APIs have another problem: over-fetching and under-fetching. The /movies/123 endpoint returns a fixed payload. If the client needs only the title, it still downloads the entire response. If the client needs cast information, it makes a second request.

GraphQL's Promise

GraphQL lets clients specify exactly what data they need:

query {
  movie(id: "123") {
    title
    releaseYear
    cast {
      name
    }
  }
}

The server returns only the requested fields. No over-fetching. If the client needs more data, it extends the query rather than making additional requests. No under-fetching.

But running a single GraphQL server creates the same bottleneck as the API gateway. One team owns the schema, and every service change requires coordination.

Federation Architecture

GraphQL Federation solves this by distributing the schema across services. Each service defines the types it owns and can extend types defined by other services.

The Movie service defines:

type Movie @key(fields: "id") {
  id: ID!
  title: String!
  releaseYear: Int!
}

The Cast service extends Movie with cast information:

extend type Movie @key(fields: "id") {
  id: ID! @external
  cast: [Actor!]!
}

A gateway composes these schemas into a unified graph. Clients query the gateway as if it were a single GraphQL server.

Query Planning

When a query arrives, the gateway builds an execution plan. Consider:

{
  movie(id: "123") {
    title
    cast {
      name
    }
  }
}

The planner determines:

Fetch movie(id: "123") from Movie service, getting title and id
Use the id to fetch cast from Cast service

This is a dependency graph. The cast fetch depends on the movie fetch. The gateway executes fetches in topological order, parallelizing where possible.

For $n$ independent fetches, execution time is:

T_{\text{parallel}} = \max(T_1, T_2, \ldots, T_n)

Rather than sequential:

T_{\text{sequential}} = \sum_{i=1}^{n} T_i

Entity Resolution

The @key directive defines how to look up an entity. When the gateway needs to fetch cast for a movie, it calls the Cast service with a representation:

{ "__typename": "Movie", "id": "123" }

The Cast service implements a resolver that takes these representations and returns the extended fields. This is the "entity resolution" pattern.

Each service can resolve entities independently. The gateway doesn't need to know how Cast service stores its data or how it relates movies to actors internally.

Schema Composition Rules

Federation enforces rules to ensure the composed schema is valid:

Only one service can define a field on a type (except extensions)
Extended types must declare which fields they use with @external
Key fields must be resolvable by the defining service

These rules prevent conflicts and ensure every query has a clear execution path.

At composition time, the gateway validates that all type references resolve. If Service A references a type that Service B was supposed to define but didn't, composition fails. This catches integration errors before deployment.

Caching and Performance

GraphQL responses are harder to cache than REST. Every query can request different fields, so the cache key isn't just the URL.

Netflix addresses this with:

Persisted queries: Clients send a query hash instead of the full query. The gateway maps hashes to queries. This enables caching by query hash.
Partial caching: Individual entity resolutions can be cached. The cache key is (type, id, fields).
DataLoader pattern: Within a single request, multiple entity lookups are batched. If the query needs movies 1, 2, and 3, one batched call fetches all three.

Team Autonomy

The key benefit is team autonomy. The recommendations team can add a recommendations field to the User type without coordinating with the user profile team. They deploy their service, register the schema extension, and the gateway incorporates it.

Schema changes go through a registry that validates composition rules. If a change would break composition, the registry rejects it. This provides guardrails without requiring human coordination.

Netflix reports that federation reduced their API change lead time from weeks to days. Teams iterate independently within the boundaries enforced by the schema registry.

Tradeoffs

Federation adds complexity. The gateway is a critical path component. Query planning has overhead. Debugging spans multiple services.

Netflix invested heavily in observability. Each query is traced across services. Performance regressions are detected automatically. The tooling cost is significant but necessary at their scale.

For smaller systems, a single GraphQL server might be simpler. Federation pays off when team coordination becomes the bottleneck, typically at dozens of services and teams.