GraphQL Lessons After 4 Years: Scaling Subscriptions

Scaling is relative. When Uber builds a scalable metric service, they build a proprietary database to resolve their queries. When Facebook scales a live feed, they build it to support millions of connected clients. This post is for the rest of us. Subscriptions in GraphQL have taken a back seat and it’s my goal to change that.

The most popular subscription client is abandonware and uses an insecure, inefficient protocol. That means those of us who want to build a real-time app are forced to roll our own. So if you have dreams of hitting 10,000 connected clients, but you’re also terrified that it might cause your server to catch fire, read on. We’ll walk through the GraphQL reference implementation, tear it apart to maintain a stateless execution service, and use some clever tricks to keep your response times way lower than any hosted GraphQL service out there.

GraphQL Subscription Basics

Behind the scenes, the subscribe function does only two things. First, it creates a source event stream. Then, for every source event, it maps the event to a response. It’s simpler than it seems. A source event is nothing more than what gets posted to the PubSub. That event then gets passed to the execute function and out pops the response, which you can send to the client.

There’s a good reason for two streams. Imagine Alice triggers a mutation like CreateTask and we want to tell Bob about that. So, the mutation pushes the source event to the PubSub, but what should the source event include? To answer that question, we need to know what Bob had requested. Maybe Bob wants the entire Task, or maybe just who made the task. Heck, he may just want to know the updated total number of tasks! Since we can’t be sure until we look at Bob’s subscription (which might be on a different server) the best we can do is include the bare minimum: a taskId and the payload type to resolve the response, like CreateTaskPayload (Pro tip: sharing a payload type between mutations and subscriptions greatly simplifies your business logic, see The Hybrid Strategy for GraphQL Subscriptions).

The ability for a single event to transform into a bespoke response for each subscriber is powerful, and something GraphQL offers out of the box. The problem arises when we begin to scale — our GraphQL execution service is tightly coupled to the WebSockets and subscriptions that rely on it. In a perfect world, our GraphQL execution service would be stateless, and our subscription service would maintain the subscriptions and WebSocket connection. So let’s build it.

Creating a Stateless Execution Service

Creating a GraphQL Subscription Service

The life of a subscription begins with emitting an event to the PubSub (e.g. Redis, RabbitMQ) from a mutation. From there, the PubSub listener needs to look up a list of GraphQL Subscriptions and republish the event to each. It’s a PubSub inside a PubSub, and you can build one in 50 LOCs. Note that while some npm packages offer this functionality, I have found them to be largely inefficient (e.g. unnecessary lookup tables) and at times leak memory. In my experience, if it’s 100 LOCs or less, it’s better to build vs. buy.

The second trick is to convert that event callback handler into an async iterator. If this looks a little intimidating, don’t worry. The article Understand Async Iterators Without Really Trying teaches you how to do it in 5 minutes using a simple click listener. Why not NodeJS streams or observables? Simply because async iterators are native ECMAScript, which is why graphql-js chose to use them.

Now, instead of calling subscribe, which uses the default execute function, we call createSourceEventStream, which will return the source event stream we just created. While the function may seem esoteric, this is exactly why it was built. The final step is to transform the source event into a response by asynchronously calling our stateless execution service. While the full implementation is less than 50 LOCs, the gist is even shorter:

async next() {
// wait for a new source event from the PubSub
const sourceIter = await this.sourceStream.next()
// if the event is "done" then there's no value
if (sourceIter.done) return sourceIter
// include the socketId of the user that triggered the mutation
// and the dataLoaderId so we can reuse it
const {mutatorId, dataLoaderId, rootValue} = sourceIter.value
// include everything needed to execute the query
const {socketId, authToken, query, variables} = this.context
// ignore the listener if they triggered the mutation
if (mutatorId === socketId) return this.next()
const result = await callStatelessExecuteService({
query,
authToken,
dataLoaderId,
variables,
rootValue,
socketId
})
return {done: false, value: result}

}

And there we have it! The ResponseStream is calling our stateless execution service instead of defaulting to an execute call. As we grow the number of stateless execution services, we can put them behind their own reverse proxy (or even use a hosted service). We can also independently scale our socket servers, which will become critical as we strive to reduce intra-team latency for our growing international user base. But before we do that, we’ll want to make sure we squeeze all the efficiency we can out of each service.

Maximizing Efficiency

Tip #1: Use a DataLoader

Tip #2: Reuse DataLoader for Subscriptions

Tip #3: Lazily Instantiate DataLoaders

Tip #4: Use graphql-jit

Tip #5: Don’t Monitor so Gosh-Darn Always

Tip #6: Use Persisted Queries

Tip #7: Use Execution Results to Update Subscription State

Tip #8: Popular Doesn’t Mean Better

Wrapping Up

Building the future of work