GraphQL Lessons After 4 Years: Scaling Subscriptions

8 min readMar 2, 2020

Scaling is relative. When Uber builds a scalable metric service, they build a proprietary database to resolve their queries. When Facebook scales a live feed, they build it to support millions of connected clients. This post is for the rest of us. Subscriptions in GraphQL have taken a back seat and it’s my goal to change that.

The most popular subscription client is abandonware and uses an insecure, inefficient protocol. That means those of us who want to build a real-time app are forced to roll our own. So if you have dreams of hitting 10,000 connected clients, but you’re also terrified that it might cause your server to catch fire, read on. We’ll walk through the GraphQL reference implementation, tear it apart to maintain a stateless execution service, and use some clever tricks to keep your response times way lower than any hosted GraphQL service out there.

GraphQL Subscription Basics

The lifecycle of a graphql query is pretty simple. The server receives a query, parses it, validates it, then resolves it. Using graphql-js , a basic implementation simply calls the graphql function. A slightly more advanced implementation caches the parsed & validated query AST so the server only has to worry about calling execute. Subscriptions are similar, except since we want an async iterable instead of a promise, we call subscribe instead of execute.

Behind the scenes, the subscribe function does only two things. First, it creates a source event stream. Then, for every source event, it maps the event to a response. It’s simpler than it seems. A source event is nothing more than what gets posted to the PubSub. That event then gets passed to the execute function and out pops the response, which you can send to the client.

There’s a good reason for two streams. Imagine Alice triggers a mutation like CreateTask and we want to tell Bob about that. So, the mutation pushes the source event to the PubSub, but what should the source event include? To answer that question, we need to know what Bob had requested. Maybe Bob wants the entire Task, or maybe just who made the task. Heck, he may just want to know the updated total number of tasks! Since we can’t be sure until we look at Bob’s subscription (which might be on a different server) the best we can do is include the bare minimum: a taskId and the payload type to resolve the response, like CreateTaskPayload (Pro tip: sharing a payload type between mutations and subscriptions greatly simplifies your business logic, see The Hybrid Strategy for GraphQL Subscriptions).

The ability for a single event to transform into a bespoke response for each subscriber is powerful, and something GraphQL offers out of the box. The problem arises when we begin to scale — our GraphQL execution service is tightly coupled to the WebSockets and subscriptions that rely on it. In a perfect world, our GraphQL execution service would be stateless, and our subscription service would maintain the subscriptions and WebSocket connection. So let’s build it.

Creating a Stateless Execution Service

In a production application, GraphQL queries and mutations can come in from a variety of sources. From our client app, they arrive as persisted queries. From our GraphiQL admin interface, they come in as a full string that will need to be parsed & validated. If it is a webhook or superuser, it might use a private schema. If the caller is a subscription, we’ll need to pass in a rootValue and hopefully reuse the dataloader to reduce resolution time. No matter the business logic, there is no state preserved by the service. This is important because as we squeeze performance out of it, we only have to focus on improving throughput, not memory management. That means we can incrementally improve the service by introducing graphql-jit, deploying more instances around the globe, and using dataloaders more aggressively (see tips below).

Creating a GraphQL Subscription Service

With query execution solved for, all that remains is building a service that holds onto the state: the transport (WebSocket, SSE, or WebRTC) and GraphQL subscriptions. Think of a subscription like a database cursor, except in the form of an async iterable.

The life of a subscription begins with emitting an event to the PubSub (e.g. Redis, RabbitMQ) from a mutation. From there, the PubSub listener needs to look up a list of GraphQL Subscriptions and republish the event to each. It’s a PubSub inside a PubSub, and you can build one in 50 LOCs. Note that while some npm packages offer this functionality, I have found them to be largely inefficient (e.g. unnecessary lookup tables) and at times leak memory. In my experience, if it’s 100 LOCs or less, it’s better to build vs. buy.

The second trick is to convert that event callback handler into an async iterator. If this looks a little intimidating, don’t worry. The article Understand Async Iterators Without Really Trying teaches you how to do it in 5 minutes using a simple click listener. Why not NodeJS streams or observables? Simply because async iterators are native ECMAScript, which is why graphql-js chose to use them.

Now, instead of calling subscribe, which uses the default execute function, we call createSourceEventStream, which will return the source event stream we just created. While the function may seem esoteric, this is exactly why it was built. The final step is to transform the source event into a response by asynchronously calling our stateless execution service. While the full implementation is less than 50 LOCs, the gist is even shorter:

async next() {
  // wait for a new source event from the PubSub
  const sourceIter = await this.sourceStream.next()  // if the event is "done" then there's no value
  if (sourceIter.done) return sourceIter  // include the socketId of the user that triggered the mutation 
  // and the dataLoaderId so we can reuse it
  const {mutatorId, dataLoaderId, rootValue} = sourceIter.value  // include everything needed to execute the query
  const {socketId, authToken, query, variables} = this.context  // ignore the listener if they triggered the mutation
  if (mutatorId === socketId) return this.next()  const result = await callStatelessExecuteService({
  query,
  authToken,
  dataLoaderId,
  variables,
  rootValue,
  socketId
  })
  return {done: false, value: result}
}

And there we have it! The ResponseStream is calling our stateless execution service instead of defaulting to an execute call. As we grow the number of stateless execution services, we can put them behind their own reverse proxy (or even use a hosted service). We can also independently scale our socket servers, which will become critical as we strive to reduce intra-team latency for our growing international user base. But before we do that, we’ll want to make sure we squeeze all the efficiency we can out of each service.

Maximizing Efficiency

Tip #1: Use a DataLoader

Getting all the efficiency out of a stateless execution service begins with the dataloader. As you determine which queries are hot, you can refactor your direct database queries to using a dataloader. This will save duplicate queries, which is extremely useful for graph-type data structures (it’s called GraphQL for a reason!). For example, if a Team requests a User that requests the Team, it’ll only request the Team once. Caching individual DB hits is far more powerful than something that caches the entire query, and far more beneficial for real-time results.

Tip #2: Reuse DataLoader for Subscriptions

If your business logic allows, you can reuse the same dataloader that your mutation used for the subscription. In practice, this can reduce your resolution time to <1ms for subscription payloads. All that’s required is a dictionary of dataloaders with a TTL on each. Again, nothing that 50 LOCs can’t fix. To reuse the dataloader, simply publish its ID so the subscription service knows which execution service to call.

Tip #3: Lazily Instantiate DataLoaders

In earlier versions, before every GraphQL execution I would create an object with about 30 dataloaders in it and add that to the GraphQL context. After profiling the heap usage, I found that it was allocating/GCing ~16KB per request! So, by using a getter pattern, I refactored the class to only instantiate a dataloader when used. By using some Typescript trickery, I was able to maintain the same type-safe guarantees so typos are still caught before runtime.

Tip #4: Use graphql-jit

After a GraphQL query is parsed and validated, you’re left with an AST that doesn’t have predictable return values. While this isn’t too important to the developer, it’s hugely important to the V8 JavaScript engine. graphql-jit rewrites the AST into a function that provides predictable return types, which reduces the admittedly non-trivial overhead that GraphQL uses.

Tip #5: Don’t Monitor so Gosh-Darn Always

Hosted GraphQL solutions that offer a monitoring “feature” do more harm than good. The overhead of these services — checking the resolution time for every single GraphQL field is not trivial. Have a problem that you need to narrow down? Monitor up. But once things look good, don’t accept a 20%+ increase in resolution time as a cost of doing business.

Tip #6: Use Persisted Queries

Replacing the full query string with a hash is both more efficient and secure. In our app, certain subscription queries were upwards of 15KB. Assuming an MTU of 1500 bytes, that means sending 10 packets to the server — a difficult task for a mobile device on the go. A single hash guarantees it fits into a single packet. Using a persisted query also means the query is trusted. No annoying “security researchers” making arbitrarily deep queries in an attempt to DOS our server. The only gotcha is that we need to know if the hash refers to a a subscription or a query/mutation to dispatch it to the correct service. To do that, you can write a custom hashing function that prefixes the query hash with the operation type.

Tip #7: Use Execution Results to Update Subscription State

Whether you use a JWT or a session ID, chances are your WebSocket has some authentication state that a GraphQL mutation may change. For example, if you have a resetPassword mutation, you’ll probably want to force all other connected clients for that user to log out. Simply check the payload type of the result, and handle it appropriately. In practice, this allows us to guarantee the validity of the JWT for the session, which means we only have to check the JWT blacklist when the socket connects. That’s a huge performance win and addresses the #1 concern that’s always brought up in the cringeworthy, never-ending “JWT is bad” debates.

Tip #8: Popular Doesn’t Mean Better

What’s the difference between a Junior and Senior Developer? The Senior knows that the sexy hosted solution with the CLI that sets up your project in “3 Easy Steps” is going to be the bane of your existence in 6 months when you’re locked into its walled garden and you need to do something it doesn’t support. Hosted solutions are buggy. A small SaaS can go broke. A megacorp can sunset services with little warning (Google, anyone?). You know what’s sexy? A vanilla GraphQL server on bare metal.

Wrapping Up

There you have it. Everything I’ve learned about GraphQL Subscriptions after 4 years of trial and error. If playing with this stuff is interesting to you, join the fun and PR some open source projects! For example, graphql-jit needs a PR to support subscriptions. If getting paid to write open-source code is your jam and you’d like to do it from anywhere in the world (cheers from Medellín, Colombia) we’re hiring folks to come build the future of remote work.