The Unavoidable Truth Of Moving Fast And Breaking Things

Editor’s note: Andrew McCollum is a co-founder of Facebook and investor in Opbeat, an ops platform for developers.

“Move fast and break things”: It’s one of the principles that has guided Facebook’s development process since its earliest days. These five words encapsulate a philosophy of rapid development, constant iteration and the courage to leave the past behind. Of course, some might wonder why you couldn’t just stop at the “move fast” part. The truth is that breaking things is unavoidable.

Even disregarding features that “work” but need to be broken in order to continue innovating (for example, how the profile has changed, often dramatically, over the years), Facebook is a social product connecting over a billion people across the globe. It simply isn’t possible to simulate the unique strains that this level of activity creates. More importantly, it’s generally impossible to understand how people will use a feature or react to a change until after it has been implemented and pushed out into production (though generally first to a subset of users).

A billion people will pretty quickly try every possible way to interact with your code, so features will be used in ways you never expected, and sometimes things will break in ways that you didn’t anticipate. Because you can’t get that level of feedback until things reach production, it means that moving fast is inextricably tied with the process of deployment.

In Facebook’s early days, when it was just a social network for Harvard students, “deployment” just meant pushing a new version of the dozen or so PHP files that comprised Facebook up to a single Apache server. There wasn’t a clear line between development and deployment, and there was certainly no formal release process. As the social network launched at more and more colleges, the site quickly expanded to dozens of servers, and then to our first collocation center, where we racked all the servers ourselves in one epic all-night session.

Moving fast is inextricably tied with the process of deployment.

As Facebook grew, the deployment process quickly became more structured, which led us to realize a necessary corollary of the MFABT philosophy: trusting and empowering engineers. This discovery came when Facebook hired the first members of its ops team. When the new ops team arrived, it immediately wanted to change the way Facebook handled deployment. Facebook was less than a year old, and at that time, any engineer could push code live to production — once the changes had been checked by other engineers, of course.

The new ops team wanted to change this process by creating a staging environment that would be a necessary stopping point before any code touched production. Once there, each release would be thoroughly tested by a QA team to make sure that nothing would break when it was pushed live.

Adam D’Angelo, who would later become Facebook’s CTO, led the charge in resisting this change. He argued that not only was this impossible due to the unique characteristics of Facebook (mentioned above), but that it would dramatically slow down Facebook’s speed of development. In the end, Mark agreed with Adam, and the conversation changed to how we could build tools to support this kind of rapid development.

As Facebook continued to scale — and as the number of servers grew into the hundreds — it soon became impractical to push new versions of the code live at a moment’s notice. This necessitated weekly release cycles that happened during low points in the site’s usage. This created an opportunity to bring the idea of staging back, albeit in a modified form. Changes to the code were first pushed internally to the version of Facebook that employees used, effectively making the whole company part of the QA team.

Engineers still had broad latitude to decide when their code was production-ready, but there was a window of time during which changes could see significant usage internally before they reached the outside world. This also allowed the team to get feedback from a larger group for features that were still in flux.

While deployment has become more advanced and is no longer closely tied to weekly cycles, for the most part this is still the system in place today, and the drive of the ops team continues to be towards making it faster for engineers to release their code (though with some important safeguards). It’s always easier to see the right answers in hindsight, but Facebook’s speed of technical advancement turned out to be one of its greatest assets in the battle with its early rivals.

With all of the cloud platforms out there, it’s easier to deploy and scale web services than ever before. This ease is leading many startups to forgo a dedicated ops team for longer, leaving the duties of deployment and server management to developers. This requires developers to fix things when they break, but also enables them to move faster. It’s certainly a dramatic change from when Facebook launched more than a decade ago, but it’s empowering teams that don’t have the resources of Facebook to move even faster and break even more things.