A database migration gone awry caused the outage and poor availability that GitHub customers experienced this week.
The root of the problem stemmed from a database replacement done last month. During that maintenance, the GitHub team replaced its aging pair of DRBD-backed MySQL servers with a 3-node cluster. The new infrastructure is designed so the MySQL database can run on all nodes at all times. This means a failover “simply moves the appropriate virtual IP between nodes after flushing transactions and appropriately changing the read_only MySQL variable.”
With the new architecture, MySQL can run on all servers all the time. In the old way, failing over from one database to another required a cold start of MySQL.
It sounds basic enough but as often happens, seemingly insignificant events can create a series of problems. As a result, GitHub is now taking a closer look at how it manages failovers and the overall management of its new cluster environment.
On Monday, the outage stemmed from what GitHub called an innocuous database migration. What resulted were some higher-than-expected loads that the GitHub operations team has not previously seen during these sorts of migrations. And that led to a cascading series of errors that resulted in the downtime.
On Tuesday, a cluster partition occurred that caused customers to get data from other users’ dashboards. In addition, some repositories created during this window were incorrectly routed. Newland said the company has removed all of the leaked events and performed an audit of all repositories that were incorrectly routed.
Newland said 16 of these repositories were private. For seven minutes they were accessible to people outside of the repository’s list of collaborators or team members. Newland said all of the owners of these repositories were contacted about the problem.
Newland said in summary that three primary events contributed to the downtime of the past few days:
- Several failovers of the “active” database role happened when they shouldn’t have.
- A cluster partition occurred that resulted in incorrect actions being performed by its cluster-management software.
- And the failovers triggered by the first two events impacted performance and availability more than they should have.
GitHub’s problems stem from a change to its database stack. It illustrates an issue with growing online communities. MySQL simply does not scale very well. And so when problems occur, it can cause issues that affect the entire organization, its customers, and overall perception.