Last week, Facebook was affected by a glitch that sent what appear to be thousands of private messages to the wrong people — a very alarming security breach given the amount of data 400 million users have entrusted to the service. News of the bug hit the press, Facebook issued a typically vague statement saying very few people were affected and that an investigation was looking into the matter, and that was that.
Most people probably just shrugged their shoulders at the news, but it’s yet another blemish against the company’s security record. This isn’t the first time Facebook has run into security issues, and I’ve grown increasingly concerned that the company might be playing fast and loose with its quality assurance policies because it doesn’t want to sacrifice the rapid iteration it’s famous for. With this in mind, I reached out to Facebook late last week to ask about their protocol for deploying code and how the bug made it through in the first place. The company responded to some of my questions, and refused to answer others.
At least, Facebook eventually answered some of my questions. At first, the company sent me a vague statement reiterating that they were investigating the issue, and that they “maintain industry-leading quality assurance and security systems, and the reliability of Facebook is our top priority.”
In response, I reminded the Facebook spokesperson that it had just sent thousands of messages to people who weren’t meant to receive them, which would seem to indicate that it is not, in fact, on the bleeding edge of online security. I restated my questions and the company got back to me with this more detailed overview of its QA and code deployment policies, found below. Note that it begins with a general statement Facebook provided, along with more direct answers to my questions (which are in bold).
Facebook hires the most qualified and highly-skilled engineers we can find – most from industry or from top universities. Upon joining the company, every engineer and engineering manager participates in a six-week intensive ‘boot camp’ training. Our code review process is rigorous, and we phase out changes and test them before they go live for real users to detect any potential issues. During code pushes, our engineering, user support, and operations teams work cross-functionally to monitor the state of the push and to identify any problems early. We also have the capability to quickly push code updates to all of our datacenters worldwide, and to enable or disable critical features of the site if there is a problem.
All of these checks worked together on Wednesday, as designed, to limit the impact of the error and stopped it within minutes. We were able to swiftly disable access to the users who received messages and remove those messages from Facebook, although we were unable to prevent email notifications from being sent to affected users. It is important to recognize that no system is perfect and no company avoids mistakes all of the time. However, we would like to take this opportunity to sincerely apologize to all affected users and ensure them that we are committed to investigating Wednesday’s issue and to learning from it.
What are your protocols for pushing code?
We have staged rollout changes that go through multiple phases before going to end users, so we can proactively detect any problems. As the changes get rolled out to users, a set of support, engineering, and operation leaders are actively engaged to monitor the state of the push. As soon as any issue is identified, we have multiple tools to quickly disable critical features. The combination of these mechanisms dramatically limited the exposure related to Wednesday’s issue.
Are there multiple people reviewing all code that gets pushed?
Yes, we have a rigorous code review process and no code goes live on the site unless it has been reviewed and approved by a skilled engineer.
What changes are you making to ensure that this does not happen again?
We cannot discuss specific improvements, but we take privacy and security very seriously and are continually improving our code standards, processes, and systems to help us build high quality products quickly.
When do you expect to conclude your investigation, because I will certainly be following up for the details about it?
As a general practice, we do not comment on investigations like this.
While interesting, none of this is particularly surprising. And because Facebook isn’t commenting on the outcome of the investigation, we’ll probably never find out what caused the bug (or if company protocol was even followed in this case). But hey, at least they say they’re doing the right things.
It’s worth pointing out that Facebook is by no means the only company affected by such issues. Last year, I wrote a post called the Sorry State of Online Privacy, where I detailed some of the security lapses that had hit Facebook, Twitter, and Google (and of course there’s the recent Google Buzz fiasco). All of these companies would likely claim to have state of the art testing and security measures, yet such problems seem to pop up every few months. I’m aware that it’s impossible to have a fully secure system, but that doesn’t mean engineering teams should be treating these problems as inevitabilities. To reiterate what I wrote last year, the word ‘private’ should not mean “this will remain hidden until we accidentally break something”.