Many years ago I had been managing the server for a rather large gaming community with tens of thousands of users. I was singularly in charge of various aspects: sysadmin, frontend and backend developer, database admin, network admin, and even email admin. Nobody tends to notice when things are running smoothly—not that I was unappreciated for my work—but small problems can snowball into really big problems due to panic and lack of planning.

This had been largest project I had ever been in charge of, and I lacked some important lessons which would result in me inadvertently deleting nearly 30,000 users from the community’s forum. It all started with a problem with the email server: we’d been blocked by Microsoft! That meant thousands of users who registered with any Microsoft email ranging from Hotmail to Live could no longer receive emails from us.

I contacted Microsoft but that went nowhere as they gave a canned answer about spam and suggested enabling SPF, DKIM, and DMARC—things I had already done to no avail against their merciless spam filter. We just weren’t big enough to warrant their time.

A Series of Unfortunate Events

My initial solution to this problem was to block all Microsoft emails from being used to register new accounts, and to display a notice targeted only to users with Microsoft emails asking them to change their email address. This was fine for a time, but I had to constantly manually intervene with users who needed to reset their passwords and couldn’t receive the password reset email, so I decided to try something else. I wrote a database query to set all users with a Microsoft email to a special account state for email validation!

This email validation mode is used for newly registered accounts so that they are required to use a valid email address and prove they own it. With this change users who login will be forced to change their email address instead of ignoring the big notice asking them to change it, and I’ll have to manually intervene less and less over time. Seems like an easy fix, right?

Very wrong. As it turns out, an unrelated plugin I had installed for the forum did something I hadn’t considered. It had a cron job which would purge “stale” accounts which had been registered but never confirmed their email address. But this job wouldn’t run until the middle of the night, so the ticking time bomb was set and would go off when I wasn’t available to deal with it.

By morning, I was alerted to the fact that nearly 30,000 user accounts on the forum had vanished. Panic set in as I tried to discover what happened, and I eventually fingered the offending plugin for its useful yet deadly feature due to my quick-fix solution. But that’s alright, because I maintained database backups.

From Bad to Worse

Except, not really. What I called a “backup” was in reality a script I’d occasionally run manually to make a backup of the database, and the two most recent backups I had were of a few days ago and about a month ago. Not ideal, but not too terrible. But remember how I said panic and lack of planning can snowball? I ran a command along the lines of this:

mysqldump -u forum_user forum_db > days-old-backup.sql

Oops. A backup plan doesn’t truly exist if it hasn’t been tested, and I just blew it severely with a lapse in judgment and a small typo; I overwrote the only copy of the most recent backup. The other backup was nearly a month old! But, there was no choice, I had to load it up to recover the deleted accounts. This time I created a copy of the database backup and paid very close attention to the import command before running it. The database was imported and, while the site had effectively been rolled back by nearly one month, everything until that point had been successfully recovered. It could have been worse, but it was still a huge blow to my ego.

I had considered splicing the month-old backup into the current database in order to recover the accounts, but there was no telling what damage such a fix might have caused, and I didn’t have time to figure it out. I also could have tried splicing the current database into the old database to recover posts, but the problem of identifying potential conflicts remained.

Lessons Learned

This series of unfortunate events led me to realise that I had become complacent with how things were running without considering how they could be improved. Backups needed to be automated, and the process of recovering from a backup needed to be tested. Further, the other administrators needed to take on responsibilities to increase the bus factor. The solution I ended with was a script on the server to maintain yearly, monthly, and daily deduplicated backups.

This meant any future occasion where something were to go terribly and unexpectedly wrong, I could recover the server to a point no more than 24 hours in the past, and ideally if it became necessary the other administrators could do so themselves and not worry about destroying anything irrecoverably in the process.

I also attempted, but failed, to increase the bus factor. The most recently generated backups were made available to each administrator via HTTPS using client certificate authentication, and the backups themselves were encrypted and authenticated using GPG with each administrator’s public key. While the bus factor had technically been increased, it in reality was not. Such an overly complicated system was unworkable for the other admins and I hadn’t taken measures to alleviate this.

In Conclusion

Have a backup plan, test the backup plan, and increase the bus factor. Avoid single points of failure. Contemplate why fire drills exist.