Instapaper says today it has fully recovered from its extended outage, which caused the service to be down for over a day last week, then saw it return in a limited capacity. The popular bookmarking service, which had millions of users at the time of its acquisition by Pinterest, suffered an outage last week caused by hitting a system limit on its AWS hosted database.
When the company posted about the outage back on February 9th, the service had been down for 31 hours. To get back online more quickly, Instapaper decided to only bring the last six weeks of users’ saved articles back online. That allowed people to continue bookmarking and reading their most recent saves, but didn’t include the service’s much larger archives.
At the time, Instapaper said it believed the restore of its archives could take another week, and promised it would try to get them all online by Friday, February 17th at the latest.
Today, at 1 AM PT, Instapaper was able to completely restore its service, the company said in an email to users.
“We performed the restoration without losing any of your older articles, changes made to more recent articles or articles saved after recovering from the outage,” the email explained.
The company also went into more detail about the outage itself in the email, and published a full postmortem authored by Pinterest product engineer Brian Donohue over on Medium. Here, he explains that the root cause was a data failure caused by a 2 TB file size limit for RDS instances created before April 2014. Instapaper’s “bookmarks” table where users’ saved articles are stored hit that limited on Wednesday of last week, causing errors.
Donohue says that the team had no knowledge of the database limit, and there was nothing that would have alerted them to the fact.
“As far as we can tell, there’s no information in the RDS console in the form of monitoring, alerts or logging that would have let us know we were approaching the 2TB file size limit, or that we were subject to it in the first place. Even now, there’s nothing to indicate that our hosted database has a critical issue,” he writes.
Instapaper credited the Pinterest Site Reliability Engineering team and the Amazon Relational Database Service team, who both worked with the team over the weekend to speed up the recovery process, allowing them to complete the task ahead of schedule.
Still, the company maintains that the issue itself was “both difficult to predict and prevent, and the nature of the outage is extremely rare and unlikely to recur,” Instapaper’s email to users noted.
This statement should quell some users’ fears that the bookmarking product hasn’t been given the engineering resources and attention since being bought by Pinterest.
In addition, the team has learned a valuable lesson from the outage. It made the company aware that it didn’t have a disaster recovery plan in place for this type of scenario – something it will address going forward. Instapaper is now working on a system that will immediately escalate issues to Pinterest’s Site Reliability Engineering team, and will test its MySQL backups every month, instead of every three months, the company says.