Friday, October 23, 2009

Making my own bed and laying in it

I broke my own rules.

The rules I broke:

  1. Don't do anything twice if you can help it.
  2. Don't take the shortcuts that make it impossible to follow rule #1.
Part of what I do involves doing firedrills on various applications in our production environment. Note that our production environment is full of redundancy, so in theory, when an application goes down, other like-minded applications pick up the slack, giving us time to recover. A typical firedrill goes like this:
  1. Submit a ticket to RT with the subject: [FIREDRILL] Production API is down
  2. Shut down an API instance
  3. Don't panic
  4. Bring up a new API instance via a nice web GUI
  5. Add the API to a load balancer pool
  6. Make sure our alerting system picks up the new instance and shows green
  7. Close out the ticket
Easy, right? I have a list on our internal wiki of all of our applications noting if a particular application is ready to be fully firedrilled and when each has last been firedrilled.

The reality is that not every single application is fully ready for such highly automated recovery. Any application that is not firedrill worthy goes on the scary list, and needs addressing as soon as possible.

Our admin node runs Puppet, builds out RPMs, and has a wonderful GUI called Maestro built by one of our in-house PHP gurus. Maestro lets our engineers leverage Puppet to do their code deployments. Puppet is itself an amazing tool, and the GUI we have sitting on top of it is a big enabler for our engineers. We only run one admin node currently, because we can handle a bit of downtime if the admin node goes down.

When I built out the first iteration of the admin node, things started out well. But soon I took shortcuts. As in, the admin node was never designed to be deployed via Puppet and Maestro at all. Whoops. This goes #1 on the scary list. If the admin node goes completely belly up, it's non-trivial to recover it. So for the past few days, I've been working on building puppet recipes and bootstrapping a new admin node. Now, one might ask, why don't I take a snapshot of the image and call it done? Well, the admin node changes. A lot. And some of those changes have been hacks. By me. So a snapshot of the image might be somewhat helpful, but still doesn't prevent me from having to do a lot of work. But if I build out the right recipes, and make sure that all of the moving parts are either in RPMs or pulled out of our version control system, and make sure that I never hack on the admin node again, disaster recovery becomes simple. Boot up a new instance, get a few packages and files installed, and then tell the node to build itself out, and I'm exactly where I was before the node went down.

No comments: