How Roblox completely transformed its tech stack

And now has full control of its technological destiny

Picture yourself in the role of CIO at Roblox in 2017.

At that point, the gaming platform and publishing system that launched in 2005 was growing fast, but its underlying technology was aging, consisting of a single data center in Chicago and a bunch of third-party partners, including AWS, all running bare metal (nonvirtualized) servers. At a time when users have precious little patience for outages, your uptime was just two nines, or less than 99% (five nines is considered optimal).

Unbelievably, Roblox was popular in spite of this, but the company’s leadership knew it couldn’t continue with performance like that, especially as it was rapidly gaining in popularity. The company needed to call in the technology cavalry, which is essentially what it did when it hired Dan Williams in 2017.

Williams has a history of solving these kinds of intractable infrastructure issues, with a background that includes a gig at Facebook between 2007 and 2011, where he worked on the technology to help the young social network scale to millions of users. Later, he worked at Dropbox, where he helped build a new internal network, leading the company’s move away from AWS, a major undertaking involving moving more than 500 petabytes of data.

When Roblox approached him in mid-2017, he jumped at the chance to take on another major infrastructure challenge. While they are still in the midst of the transition to a new modern tech stack today, we sat down with Williams to learn how he put the company on the road to a cloud-native, microservices-focused system with its own network of worldwide edge data centers.

Scoping the problem

Williams joined Roblox in September 2017 as VP of Corporate and Production Engineering and spent a couple of months just figuring out the current state and how the networking and resources were being allocated. He knew things were not going well, but his observation phase deepened his understanding of the issues facing the company.

“I spent the first two months just sort of experiencing what Roblox was like. I found that we had what we call ‘site events,’ which are essentially [business interruption] events, and we had performance issues due to our dependency on third-party resources and providers,” Williams told TechCrunch.

When Williams went to the board to get his initial round of funding, he explained the scope of the problem and how he wanted the company to reduce its dependency on third-party resources and take control of its own destiny by building its own data centers.

“I said if we build this thing and we own it, we will have total control over reliability and performance and based on my experience, having control over those things will create more trust and create more end-user stickiness where end-user growth should move up and to the right,” he said.

As you might expect, the board gave him the initial go-ahead. At this point, it’s worth noting that the company had around 64 million monthly active users (MAUs). This is a key metric for a gaming platform like Roblox, and, if Williams was right, the transformation he was about to lead would show the executive team that the investment was worth it by greatly improving that number.

Becoming thoroughly modern

Williams was there because of his extensive experience building systems like this. While no two systems are exactly alike, he knew he could use what he had learned along the way to help transform Roblox’s tech stack.

“A lot of what I helped with at Dropbox was derived from things that I did at Facebook, and though every place is different, there’s a lot of lessons that we can express at every place,” he said.

He knew that he needed to completely change how Roblox was working. “We went from this bare metal operating system environment where we leveraged or built in third-party networks, where we were really only responsible for game serving, and we evolved into a modern containerized Linux environment with our own edge nodes,” Williams explained.

To add to the degree of difficulty of getting to that end goal, he had a time crunch because a big bill was coming due for the Microsoft Windows Server operating system licenses that the company used to run its bare metal servers. That created a sense of urgency to move the system over to an open-source Linux environment much faster than it might have.

Such an undertaking required a substantial team, but for starters Williams had a dozen people consisting of some holdover Roblox engineers and a handful of new people. He doubled the team in the first year. By the second year, that team grew to 50, so he hired people from places like Netflix, Facebook, Uber and Dropbox who had experience running modern web-scale infrastructure.

By the end of the first year, the company had made substantial progress. For starters, they opened 10 points-of-presence (PoPs), which gave Roblox multiple edge networking nodes located across the world. That presence at the edge is essential for a gaming platform where users expect near-instantaneous response while playing.

The new PoPs gave them the ability to virtualize half of their server capacity while moving to containerized game delivery running on Linux and improving performance to 99.5% uptime. By the end of last year, Roblox had added nine more PoPs and improved to 100% containerized delivery.

By this point the company owned 100% of its server capacity, meeting one of its major goals to control its own destiny. The moves were starting to bear fruit: By the end of 2018, the company had grown to 80 million MAUs.

Pushing buttons and pulling levers

As Williams pushes his team through this type of process, he sees three main levers at his disposal: reliability, performance and cost. “I believe in these cycles where reliability is greater than performance is greater than cost.”

He added, “Those are the three levers that we have available on the infrastructure side to pull, where costs should never be the first lever. Cost should be the outcome in terms of when we get to optimization mode where if we can improve reliability and performance, we will invest in that.”

For all the progress they made on the hardware side, the company’s shift to a microservices architecture, where Roblox breaks down a big chunk of monolithic code into more manageable pieces, remains a work in progress. To facilitate the switch Williams says that ironically they created one big service to help push the infrastructure move.

Now they are working to break that up into smaller sets of services, and he says they are perhaps 10% of the way on that effort, but he hopes to finish by the end of next year. While they straddle these two worlds, they want to have a single system to push code between the two projects instead of creating different workflows.

“We recognize that we don’t want to have separate building-release pipelines, as we think that would just create cognitive load. So we’re working through a single pipeline that can manage both the big monolith as well as the new microservices architecture,” Williams said.

He expected it to take 12-18 months to shift 90% of the remaining environment to this new microservices approach, but he knows from experience that the last bits will be the toughest. “We have to accept that some small ratio, say 10% of the pool, will operate in this sort of legacy way [ … ], but we accept that as we work to evolve over time. I would say over the next two years, we would be completely migrated over to this new system, which is a feat in itself,” he said.

If you are looking for proof that the investment the company has made in building its own infrastructure was worth it, consider that the platform’s uptime performance since Williams made his initial pitch to the board has improved from less than two nines to four nines (99.95%) this year, and the MAUs grew from 64 million in Q4 2017 to over 150 million in July (the most recent number it has), a growth rate of over 130%, generating the very kind of user growth that Williams had predicted.