Tech chOps

Sunday, September 25, 2011

DevOps in Milliseconds

Another post about DevOps at AppNexus. Read it, love it, share it.

http://techblog.appnexus.com/2011/devops-in-milliseconds/

Tuesday, July 26, 2011

Metrics at AppNexus

I wrote a blog entry for AppNexus about our usage of Graphite for our metrics system. Read it a million times, and I just might win an iPad.

http://techblog.appnexus.com/2011/metrics/

Thursday, June 2, 2011

Vagrant for repeatable development environments

"Vagrant is a tool for building and distributing virtualized development environments."

I stumbled upon Vagrant last night while looking at buildbot and Jenkins.

Vagrant leverages ~~Sun's~~ Oracle's VirtualBox as well as Opscode's Chef or Puppet Lab's Puppet to give you the ability to easily create and provision Virtual Machines.

I ran through their Getting Started with Vagrant documentation, and everything worked beautifully. I'm looking forward to poking at it some more.

I think that Vagrant gives developers and ops two huge advantages:

1) Getting a new environment up and running for onboarding new developers is now a snap.

2) By deploying a development environment in the same way as other (sand, staging, prod) environments are deployed, you minimize (or eliminate) differences between environments, and therefore keep the wall between development and ops obliterated. No "oh, but it works in my dev environment" finger-pointing allowed!

Saturday, November 7, 2009

Sometimes it's the little things in life

Here at AppNexus we use Subversion for our version control system. We use a standard layout for our repositories:

trunk/

branches/

tags/

We check stuff into trunk, and when we release, we "tag" trunk/ and make a copy in the tags/ directory that is versioned. This code never gets touched again; in this way, once the code is compiled and released, if we find a bug in version 0.39, we can do a checkout of tags/0.39 to track it down. Once the bug is found, then the code is merged into trunk and a new release is generated.

In order to commit a tag, the manual process looks like this:

Figure out what the latest version in the tags/ directory is so that we can increment it for the next version. Suppose we've committed 0.38 and we want to create 0.39.

svn copy https://our.svn.repository/product/trunk \ https://our.svn.repository/product/tags/0.39

svn commit -m "Release version 0.39"

Because our release cycle is so fast, we're doing this many times over the course of the week.

Because efficiency is a good thing, we really don't want to do this over and over again by hand.

So I created a easy little tool in perl called tagger. It figures out what repository we're currently in, does a lookup in the tags/ directory, figures out what the most likely candidate is for the next version, and then does the svn copy and svn commit for us, prompting us along the way. We can also throw tagger a flag and have it do no prompting at all.

What I particularly like about tagger is that it allows me and our developers to forget all of the steps that go into tagging a release. I run tagger, hit enter a few times, and I'm done. As an added bonus, if I ever want to see the commands tagger is executing on my behalf, I can throw it into verbose or debug mode and it will show me what it is doing or would do.

Spend a little time seeing what tasks you and your developers are doing over and over again. Then see if a little scripting can ease the pain.

Friday, October 30, 2009

Clear your head

A number of years ago, working at Right Media, I was struggling to solve the crisis of the moment. I don't recall exactly what was wrong, but I'm guessing we were totally down, not serving ads at all, and I was the one who needed to figure out what was wrong and fix it. In other words, the problem was more ops related than a problem with the code behind the ad servers themselves. At any rate, my boss, Brian (whom I continue to work for at AppNexus, gave me some good advice that has stuck with me to this day. He said, and I probably paraphrase a bit here, "Pete, you don't have a clear head. Go outside, walk around the block, then come back, figure out the right way to tackle this problem, and solve it." He was right. For me, clearing my head and then tackling the problem, despite the urgency of the situation, meant that I solved the problem faster than I would have if I had just sat there banging my head against the problem and letting frustration mount.

Fast forward to today. I've been working on a script that synchronizes data between our Netezza and MySQL. In other words, I have a table in Netezza, and I need to have it replicating to MySQL. Sounds simple, right? The complication is that the MySQL table has an extra field, called last_updated, which defaults to current timestamp. The data that I pull from the Netezza lacks this field. Any data that does not need updating in MySQL should have this field left alone.

My first attempt (and what went into production) was this:

load data local infile '$csv_file' replace into table $mysql_table fields terminated by ',';

This worked just fine, until the Netezza table stopped populating and my script kept reading the same data over and over again and inserting the data (with fresh timestamps) into MySQL. This was not the desired affect. With MySQL, REPLACE INTO deletes rows and does new inserts, and since we weren't specifying the last_updated field, it would auto-populate with the current timestamp. Since the data I was inserting was identical to the data that was there before, the last_updated field should have been left alone.

So today, I was working on a new script that uses an ODBC connection via DBD::ODBC to get the data out of the Netezza and the DBD::MySQL module to connect to the MySQL. I had the code nearly done, and I stopped and looked at it.

And shuddered.

I realized that going by the "hit by the bus" theory, if someone else needed to take over my code, they were, to put it bluntly, screwed. And that didn't feel so good.

So, I took a walk.

When I returned, a much more elegant solution was in my brain, ready to go:

1) Select the data out of MySQL into a hash
2) Select the data out of Netezza
3) If the data isn't already in the hash, put it in the hash
4) If the data is in the hash but isn't identical, replace the value in the hash
5) If the data is in the hash and is identical, delete it
6) Iterate over the hash, creating INSERT INTO ... ON UPDATE ...; statements.

Now, I'm not positive I found the perfect solution OR that there isn't a better way, but my current solution is *much* better than what I had written previously.

When you've got a tricky problem to solve, take a walk.

Friday, October 23, 2009

Making my own bed and laying in it

I broke my own rules.

The rules I broke:

Don't do anything twice if you can help it.
Don't take the shortcuts that make it impossible to follow rule #1.

Part of what I do involves doing firedrills on various applications in our production environment. Note that our production environment is full of redundancy, so in theory, when an application goes down, other like-minded applications pick up the slack, giving us time to recover. A typical firedrill goes like this:

Submit a ticket to RT with the subject: [FIREDRILL] Production API is down
Shut down an API instance
Don't panic
Bring up a new API instance via a nice web GUI
Add the API to a load balancer pool
Make sure our alerting system picks up the new instance and shows green
Close out the ticket

Easy, right? I have a list on our internal wiki of all of our applications noting if a particular application is ready to be fully firedrilled and when each has last been firedrilled.

The reality is that not every single application is fully ready for such highly automated recovery. Any application that is not firedrill worthy goes on the scary list, and needs addressing as soon as possible.

Our admin node runs Puppet, builds out RPMs, and has a wonderful GUI called Maestro built by one of our in-house PHP gurus. Maestro lets our engineers leverage Puppet to do their code deployments. Puppet is itself an amazing tool, and the GUI we have sitting on top of it is a big enabler for our engineers. We only run one admin node currently, because we can handle a bit of downtime if the admin node goes down.

When I built out the first iteration of the admin node, things started out well. But soon I took shortcuts. As in, the admin node was never designed to be deployed via Puppet and Maestro at all. Whoops. This goes #1 on the scary list. If the admin node goes completely belly up, it's non-trivial to recover it. So for the past few days, I've been working on building puppet recipes and bootstrapping a new admin node. Now, one might ask, why don't I take a snapshot of the image and call it done? Well, the admin node changes. A lot. And some of those changes have been hacks. By me. So a snapshot of the image might be somewhat helpful, but still doesn't prevent me from having to do a lot of work. But if I build out the right recipes, and make sure that all of the moving parts are either in RPMs or pulled out of our version control system, and make sure that I never hack on the admin node again, disaster recovery becomes simple. Boot up a new instance, get a few packages and files installed, and then tell the node to build itself out, and I'm exactly where I was before the node went down.

Caveat Emptor

I work for AppNexus. It's an enterprise level cloud computing comany, and I'm the tech ops guy. When I try to explain what I do for a living, the answer changes depending on the audience. Here are some possible explanations:

I'm in IT. I work with computers.
I make lots of computers do "things".
I'm the director of glue!
My job is to automate myself out of my job.
I use and write software to deploy and run other software, monitor applications, and collect metrics from those applications for analysis.

Oh, yes. It's true. I work with computers on a daily basis. In fact, not a single work day goes by that doesn't necessitate me using my computer. That day will come when we all go on that corporate retreat and do zip-lines in the woods, I suspect. I'm not holding my breath.

Let me go a bit deeper, for those of you reading the last explanation and bobbing their head ever so slightly. I'm in charge of technical operations from the application level on up. AppNexus has systems administrators who deal with the hardware layer and OS layer, although unsurprisingly there can be a fuzzy area at the OS layer in the handoff from the systems administrators and me. They deal with things like datacenter layout, racks, servers, networking, Xen kernels, et cetera. Bless 'em. It's not an easy job, and it's not my area of expertise. Once the hardware is available via AppNexus's sweet cloud APIs, I get to play.

Taking advantage of OSS, I'm building on the shoulders of giants to create a development and production environment that is as automated as possible. When you've got more than a small handful of machines to take care of, you simply don't want to do it by hand. When you've got applications that can never be down, you can't take them all down at once to upgrade them. The software I use (along with a plethora of code that I've written) allows our company to check their code out of subversion, package it up, and deploy it to our environments. The software handles versioned configuration, versioned applications, rollbacks, et ceterea. It makes sure that each machine is monitored properly.

The novice reader may wonder why one way I explain my job is to say that my job is to automate myself out of a job. Seems like a dumb way to work, considering the economy, perhaps? The reality is that the work will never be done. I just don't want to do the same thing over and over again. I want to turn the things that I do into a commodity that the rest of the company takes for granted because it's reliable and makes their life easier.

I'm leveraging OSS to do what I do. I'd like to mention a few projects in particular, without whom my job would be so much more difficult. To everyone contributing to the Open Source community, thank you.

Perl - My coding is almost exclusively in Perl, with a smattering of PHP.
Puppet - I use this to ensure state on all of my machines. Puppet handles our software rollouts, as well.
Nagios - I use Nagios to monitor our software stack and send out alerts when there's trouble.
Ganglia - I use this to collect metrics centrally
Graphite - While Ganglia is great for collecting metrics, I've found Graphite to have a wonderful interface for displaying those metrics

I hope to share insights about my world to help others that are doing similar things. Do note, however, that the opinions and suggestions expressed herein are my own and are not endorsed by AppNexus. In other words, blame me, not them, if there's anything here that causes your wonder-app to crash. However, I hope that everything you read here falls under the category of "best practices". I won't suggest you run a magnet over your hard drives to align the bits for higher performance. And for crying out loud, I won't suggest that you go and re-invent the wheel when you're doing similar tasks. Enjoy.