Scaling RSG Builds With Gatsby’s Data Layer

Two weeks ago, I wrote about how Gatsby can now publish content in under 1 second, even on 10k+ page sites — through an architecture we call Reactive Site Generation (RSG).

Gatsby’s data layer is the reason we can get such good build performance. This has been a key part of our architecture since the very beginning, though it’s not always well understood. In this article, I’m going to take you through how the data layer works, how it helps us rebuild sites up to 20x faster than traditional SSGs, and how we’re continuing to make it even more powerful.

How Gatsby’s Data Layer Works

Gatsby’s data layer (exposed via a standardized GraphQL API) syncs data from any content source (such as Contentful, WordPress, Shopify) into the Gatsby DB (powered by LMDB — the fastest embedded Node.js database), giving you access to a real-time stream of data changes and an embedded cache of your site’s data.

Gatsby builds and rebuilds pages by rapidly performing three steps:

Synchronize. source plugins synchronize updated data from APIs into the embedded Gatsby DB
1. A CMS, like Contentful, invokes a webhook and the source plugin(s) keep this state up-to-date
Invalidate. Invalidated page queries are run against the embedded Gatsby DB
1. Gatsby uses the site’s dependency graph to invalidate pages based on data changes
Render. Gatsby’s page engine uses the query results to render HTML
1. The Gatsby build process builds only what changed, no more and no less

Gatsby’s data layer makes rebuilds precise and efficient. This is how we can publish CMS changes to the CDN in one second — even for very large sites.

The data layer also enables the embedded Gatsby DB, which is how we speed up building large numbers of pages.

Let’s see how build speeds compare to frameworks without a data layer.

Benchmarking Gatsby vs Next.js SSG

I created a small benchmark to compare the performance of Gatsby’s embedded data approach to directly fetching from a remote CMS API — what’s supported in lower-level frameworks without a data layer.

For the framework, I choose Next.js (though the results would be similar for any SSG/SSR framework). For the CMS backend, I chose WordPress, hosted on Pantheon with the WPGraphQL plugin.

I used both Next.js’ and Gatsby’s default WordPress starter, and then ran the Gatsby and Next build commands on my M1 Pro MacBook.

I tested the benchmark Gatsby and Next.js sites with 1000, 5000, and then 10,000 blog posts and timed how long the pages take to be rebuilt after I make a CSS change to the blog post page template. The builds are all “warm” meaning all caches have had a chance to fill to replicate the normal build environment on CI.

Large sites can build 20x faster with Gatsby

Here are the results:

	1000 Pages	5000 Pages	10,000 Pages
Gatsby	20s	23s	25s
Next.js SSG	40s	335s (5:35)	500s (8:20)

And in chart form

Why the 20x speed difference? With the data layer, Gatsby can rebuild each page without needing to go back to WordPress. At the start of a build, Gatsby does a quick check (~500ms) with WordPress to see if any data has changed, and then goes straight to building.

Low-level frameworks like Next.js don’t have data layers. So every time they build a page, they must return to the source of truth over and over to revalidate their data. So even for simple CSS changes, Next.js needs to call WordPress 10k times, once for each page.

The 20x difference in speed between Gatsby and Next.js is simply Gatsby making 9,999 fewer API calls than Next.js.

Fewer API calls mean the Gatsby build speed is CPU-bound — Gatsby saturates my laptop’s 8 cores, rendering pages at around 1200 pages / second. The Next.js build speed is bounded by the speed of the remote API so can only render at around 20 pages / second.

The larger the site, the more site operations are dominated by the weight of data operations. The more data, the more the speed of Gatsby’s embedded data cache shines.

Gatsby’s data layer is designed to let the framework cheaply access data. Instead of needing to constantly be reaching out to a remote API, Gatsby syncs data to its embedded DB where it can then cheaply and quickly update the site.

You can think of Gatsby’s data layer as a specialized Content API middleware that stitches together arbitrary backends into a unified, fast, embedded GraphQL API. (this has been the vision for Gatsby from the very beginning when we called it the content mesh)

The Perils of a DIY Data Layer: Cost and Complexity

There are of course ways to get around not having a data layer, but they add cost and complexity.

For example, you could improve the API speed by putting WordPress on a beefier server ($$$), or write your own server and client caching layers (complicated and error-prone).

Another approach is to defer page building (known as ISR in Next.js) — if you can’t build pages upfront quickly enough, you can defer building so at least the most visited pages get updated quickly.

This helps — but it isn’t free. If you start a server with 10,000 deferred pages, you’ve incurred a 10k “page debt”. You’re still on the hook to render those pages as visitors come in. It’s a gamble that you can pay off your “page debt” faster than visitors call for them. This is easy when your site is small but gets harder and harder as your site size and traffic grow.

With Gatsby’s Reactive Site Generation, you know your “page debt” is paid off because the Gatsby Data Layer is taking care of things behind the scene. You’re providing a fast experience both for developers and for your customers. With the embedded database, rebuilding even 1000s of pages can be done in seconds.

When traffic spikes, or you’re shipping a change, you don’t need to worry that visitors will see errors or an outdated version of the site. Many Gatsby users remark how nice it feels to realize they can confidently make updates on high traffic days without worrying that they might cause problems.

What’s Next For Gatsby’s Data Layer

Gatsby’s Data Layer is at the heart of Gatsby, so we have big plans to extend it and make it even more useful. Our internal codename for this project is: 🥁 Valhalla. We believe it’s a game changer for not just Gatsby, but for the entire modular web. You’ll be hearing lots more about it in the coming months but here’s a sneak peek in the meantime at two important problems we’re tackling:

⚡Accelerated Sourcing. For larger sites, the initial source of content from the API can be slow. We’re improving this by adding a remote data layer cache which sits in between the remote API and each Gatsby instance. We are seeing improvements of 10-100x performance gains. For example, the remote cache dropping sourcing time for a Drupal instance with 30k blog posts from 16 minutes to 10 seconds. A 100x faster!

Extending the Source Plugin ecosystem. We’re working on plugin starters, and toolkits that will make it easier for developers to create their own source plugins.

✨ Runtime GraphQL. By far the most requested feature for the data layer! Soon you’ll be able to query GraphQL — not just at build time — but from functions and during SSR.

Conclusion

Gatsby is an amazing modern website framework. Increasingly, it’s also a high-performance, cloud-scale service for content publishing and aggregation. Whether your site has 1k, 10k, or 100k pages — publishes are fast.

Much of the power of Gatsby comes from its data layer. It is the thing that makes Gatsby unique in the industry and why it is so well suited to large content sites. I’m incredibly proud of what the Gatsby team has accomplished over the last year to make Gatsby even faster and more scalable.

As more organizations look to adopt a decoupled architecture, choosing Gatsby means they will reap the rewards of a no-compromise architecture that delivers a great developer experience and a great customer experience. And things are only getting better from here.

Share on Twitter Share on LinkedIn Share on Facebook Share via Email