How we’re Scaling Gatsby To Millions Of Pages

This week, I built a Gatsby site with 1 million pages in just one minute — and soon, you can too.

But before I dive in and explain how, I want to thank you for coming along on this journey into the new Gatsby and our capabilities.

In the first blog post of this three-part series, I introduced Reactive Site Generation (RSG) and how Gatsby’s RSG model deployed on Gatsby Cloud enables content teams to update their sites in seconds — 100x faster than other hosts.

Then, in the second blog post, I wrote about how Gatsby’s data layer and our embedded LMDB database make a Gatsby/WordPress site build 20x faster than a Next.js/WordPress site.

Now, in this third blog post, I’ll explore the fundamental challenges with scaling content publishing and how Gatsby’s unique architecture is allowing us to solve content publishing — no matter the size of the site.

Gatsby is being adopted by many large organizations with large content sites. We want to ensure that no matter how large the site and how often its updated — content publishing remains smooth and simple.

This is the story of how we’ve scaled Gatsby’s core abstractions, the data layer and GraphQL, 10x over the past 5 years and how the next evolution will let us scale 100x. We are building towards this future on Gatsby Cloud, where sites with hundreds of thousands of pages are possible today, and millions will be amazing in the future — powered by any CMS.

Like all good scaling stories, we start vertical and are now going horizontal.

Content publishing has one job

Let’s start with content publishing.

Content publishing systems ultimately have one job. Their job is to write updates to the CDN cache for the site as quickly as possible when data and code change. If a marketer changes some copy, they wish it live, immediately. If a designer pushes a CSS change, they wish it live across the site, immediately.

Instant publishing is both the experience we want (who ever wants to wait?) but also often the experience we need e.g. to rapidly update a live news page or show the latest inventory levels in an ecommerce store.

To publish content means updating the CDN cache. And that means recreating potentially thousands of files and writing them to the CDN layer.

There are many ways to update site files and write to a CDN cache. Three common ways include:

Static Site Generators (SSG) recompute the site on demand in CI jobs and then write files to the CDN
Server Side Rendering (SSR) frameworks run processes which respond to user requests and control what the CDN caches with Cache-Control headers — recreating files gradually as necessary
Gatsby’s Reactive Site Generation (RSG) uses its dependency graph to calculate which parts of the site to rebuild and then writes updated files to the CDN

How RSG updates the CDN in seconds

RSG is fast in part because its fine-grained reactivity means we only update pages that actually need updating. We don’t need to do any work for other pages on the site.

If you have a million-page site and you update a typo in the CMS that’s only on one page, that’s all we need to rebuild.

The median publish on Gatsby Cloud takes 5 seconds. The first second is syncing data from a CMS. The next 1-2 seconds is updating internal site state e.g. modifying the list of pages and calculating what pages need rebuilt. And then the final 2-3 seconds is rebuilding pages.

This is true even for sites with thousands of pages. Most Gatsby sites can build hundreds of pages per second as all data reads are against our embedded LMDB database. So even if a few hundred pages need to be rebuilt, it still takes just a few seconds.

This scales fairly well into the high tens of thousands of pages. Take Penn State University’s news site, powered by Gatsby and Drupal. Penn State has 70k pages with 100s of people publishing 100s of times a day, but a median publish time of just 30 seconds.

But often people ask us if Gatsby can scale another 10x or 100x — to hundreds of thousands or millions of pages. And the answer is… we’re almost there.

We’re working on the last piece: speeding building and deploying massive numbers of files.

Scaling Vertically

We’ve done a lot of great work the last few years to speedup Gatsby’s build times. One key piece last year, moving from v3 to v4, was moving page building from single-core to multi-core on a single machine.

Before we launched v4, we were restricted to a single core and single process because of Gatsby’s local database, which was stored in memory. Because Node.js is a single-process runtime, there’s no built-in way to cheaply share data across multiple Gatsby worker processes.

Enter LMDB.

LMDB is an embedded key-value database — embedded meaning it runs in the same process as the application code. It’s similar in many respects to SQLite and in addition to Gatsby, powers tools like Parcel, OpenLDAP, and Mellisearch.

After a lot of investigation and prototyping last year, we picked LMDB to power the Gatsby DB. Remarkably, when moving query running from our in-memory db to LMDB, there was only a slight reduction in performance.

In addition to being incredibly fast, LMDB lets multiple processes read from the same database. So now we can spin up additional Gatsby worker processes and each has access to the same data! So our read capacity increases linearly by each worker process we can add. If one worker can run 100 queries / second then five workers can run 500 queries / second. We call this feature Parallel Query Running.

We found that even for moderately sized sites, like a 10,000-page markdown site building on my local machine, this one step could accelerate Gatsby build time by 2x.

Amdahl’s law

How far can this go? How much faster can we make Gatsby builds?

The answer to this is provided by Amdahl’s law. It answers the question of how much faster a task can go if you give it more resources.

The algorithm is that for any given compute task, you divide the work into parts which must be done serially and which parts can be parallelized. The serial work generally must be done in a single process and can’t be sped up by giving the task more resources. The parallel work on the other hand can be sped up by giving the task more resources.

So for a simple example, if a task takes 1 minute and only ½ the work is parallelizable, then doubling the resources will drop the total task duration to 45 seconds as the serial part remains at 30 seconds and the parallelizable part halves to 15 seconds.

So which parts of Gatsby are serial and which parts are parallelizable? Gatsby’s builds have two phases. The first phase we update the site state (sourcing data, update pages, etc.) and decide what needs to be built. In the second phase, we build.

This central idea around parallelization is why Gatsby v4 has been able to reduce build times for many customers. Running queries for large, content-heavy websites is a large chunk of the build process, so splitting the work across workers means faster builds.

Scaling Horizontally

Scaling vertically works well — up to a point. I’ve spun up huge cloud VMs e.g. with 48 CPUs and 128GB ram and builds fly. But at some point, you start running into additional bottlenecks:

Disk I/O can become a bottleneck as even very fast SSDs can only read/write so many thousands of files / second.
Network I/O is a very common bottleneck. After writing out files you upload the files to another system e.g. a cloud bucket for serving. On many services I’ve observed that uploading often takes far longer than the actual build.

At some size of site, it’s simpler to horizontally scale to get around single-machine bottlenecks.

The next iteration of Gatsby Cloud that builds a million pages in one minute will use horizontal scaling.

How horizontal scaling works is by partitioning work across multiple machines. With 10 worker machines, each only does 10% of the work in parallel meaning page building is 10x faster.

Each worker runs on its own machine listens on a pub/sub channel for the site leader to announce a new build. When the build starts, the leader updates the site state internally, syncs this to each worker, and then instructs each worker what partition of pages it should build.

This is almost exactly the same as Parallel Query Running on one machine except now we can partition it across as many machines as we like.

Building 1 million pages in 1 minute

A 1 million page site is a very very large site. A simple analogy — one thousand seconds is about 17 minutes. One million seconds is 12 days.

To build the 1 million page site, I set up a 45 machine cluster with each running a worker instance. Each machine had 4 CPUs for a total of 180 CPUs.

The site leader took about 10 seconds to update its state, 1.7s to finish syncing updates to each worker, and then another 48 seconds for the workers to build their partition.

Each worker built 22k pages. During active page building, the cluster was building at a rate of 21k pages / second which means 42k files / second were being written to disk which is about 460MB/second. The total size of the generated site was 22.2GB.

The really exciting part about this to me is that adding or removing workers lead to linear increases or decreases in build time. In distributed computing terms, Gatsby builds are embarrassingly parallel — meaning there’s little to no overhead to parallelize work.

This means we can keep build times constant for different size sites simply by adding more workers. So a 10k page site with one worker would build in the same time as a 100k page site with 10 workers and a one million page site with 100 workers.

Why scaling RSG is far easier than scaling SSG & SSR

Software scales until it hits a bottleneck. That bottleneck can be the algorithm, CPU, memory, disk I/O, network I/O, databases, APIs to other systems, etc.

Gatsby is carefully designed so that we can scale to any size of site without hitting bottlenecks. We can do this with thoughtful design but also, importantly, we take ownership of far more of the problem of content publishing than most frameworks and cloud services. The tight vertical integration of Gatsby running on our custom cloud infrastructure on Gatsby Cloud means we can guarantee smooth, fast content publishing experiences.

Again, we see RSG or Reactive Site Generation as something distinctly different to other paradigms for building on the web: notably SSG (Static Site Generation) and SSR (Server Side Rendering). Let’s talk about some of the bottlenecks or challenges you may face with these other modes.

SSG is bottlenecked by single-machine limitations — the number of CPU and amount of memory, etc. Also uploading thousands of files often takes much longer than the build itself. In addition, if you’re using a framework without a data layer, hitting a remote API can slow down the build significantly as I showed in my most recent blog post where a Next.js SSG site built 20x slower than Gatsby.

SSR frameworks are generally run on multiple machines so can scale up CPU/memory resources somewhat easily. But scaling up data reads remains very difficult. As your site grows, your CMS needs to support higher and higher rates of reads. Your framework might help you scale compute 10x but scaling your CMS API 10x is on you and can be quite difficult.

Firecracker microVMs

We’re rebuilding the backbone of our Gatsby Cloud infrastructure on top of Firecracker microVMs — which seems to be growing increasingly common in some of the best Cloud infra startups. Beyond the normal VM features, Firecracker offers an incredible superpower — it can snapshot a running process to disk and resume it in as little as 125ms!

This will have huge benefits across Gatsby Cloud for our functions, SSR, and site builder services. But it also enables this new horizontal scaled architecture with on-demand workers.

Most changes to sites don’t require a 180 CPU cluster to build. Most changes only update a few pages and a single CPU can handle this in a few hundred milliseconds.

With Firecracker, we can create on-demand clusters — but only as needed. Firecracker lets us create snapshots of a site worker and then spin up as many as we need in order to build out the pages. If the million-page site needs 1000 pages built, we’ll spin up one worker. If it needs 20k pages built, we’ll spin up a 20 worker cluster.

And this smooth scaling up and down of resources is only possible because Gatsby’s Data Layer and embedded database lets us scale up data reads at the same time as compute.

This is both a hyper-efficient infrastructure and an incredible content publishing experience.

Penn State’s build times are 30 seconds because builds get bottlenecked by the limitation of running on a single machine. On this coming architecture, we can dynamically 10-50x their worker capacity as needed so all site publishes are live in < 10 seconds. An incredible content publishing experience for a large newsroom.

Gatsby Cloud — a Content Publishing Platform

My software career has spanned the rise of VMs and then cloud services. And I’ve always been inspired by how amazing cloud infrastructure teams can tackle and solve extraordinarily difficult problems and offer these as services to the entire world.

Gatsby Cloud is built on top of Google Cloud Storage (an S3 equivalent). We store billions of files for sites — and we rarely worry about the underlying abstractions. Gatsby Cloud is built on top of cloud databases like Spanner and DynamoDB. We have billions of rows in Spanner and rarely think about it.

Our ambition is to do the same thing for your content site. Gatsby Cloud is making large-scale content publishing (one million pages, and beyond) a solved problem.

With upcoming improvements to horizontal scaling, improvements in Gatsby v5, Valhalla, and still more to come, we’re getting ever closer. To ~~infinity~~ one-million pages, and beyond!

Share on Twitter Share on LinkedIn Share on Facebook Share via Email