Friday, October 30, 2009

Clear your head

A number of years ago, working at Right Media,  I was struggling to solve the crisis of the moment. I don't recall exactly what was wrong, but I'm guessing we were totally down, not serving ads at all, and I was the one who needed to figure out what was wrong and fix it. In other words, the problem was more ops related than a problem with the code behind the ad servers themselves. At any rate, my boss, Brian (whom I continue to work for at AppNexus, gave me some good advice that has stuck with me to this day. He said, and I probably paraphrase a bit here, "Pete, you don't have a clear head. Go outside, walk around the block, then come back, figure out the right way to tackle this problem, and solve it." He was right. For me, clearing my head and then tackling the problem, despite the urgency of the situation, meant that I solved the problem faster than I would have if I had just sat there banging my head against the problem and letting frustration mount.

Fast forward to today. I've been working on a script that synchronizes data between our Netezza and MySQL. In other words, I have a table in Netezza, and I need to have it replicating to MySQL. Sounds simple, right? The complication is that the MySQL table has an extra field, called last_updated, which defaults to current timestamp. The data that I pull from the Netezza lacks this field. Any data that does not need updating in MySQL should have this field left alone.

My first attempt (and what went into production) was this:

load data local infile '$csv_file' replace into table $mysql_table fields terminated by ',';

This worked just fine, until the Netezza table stopped populating and my script kept reading the same data over and over again and inserting the data (with fresh timestamps) into MySQL. This was not the desired affect. With MySQL, REPLACE INTO deletes rows and does new inserts, and since we weren't specifying the last_updated field, it would auto-populate with the current timestamp. Since the data I was inserting was identical to the data that was there before, the last_updated field should have been left alone.

So today, I was working on a new script that uses an ODBC connection via DBD::ODBC to get the data out of the Netezza and the DBD::MySQL module to connect to the MySQL. I had the code nearly done, and I stopped and looked at it.

And shuddered.

I realized that going by the "hit by the bus" theory, if someone else needed to take over my code, they were, to put it bluntly, screwed. And that didn't feel so good.

So, I took a walk.

When I returned, a much more elegant solution was in my brain, ready to go:

1) Select the data out of MySQL into a hash
2) Select the data out of Netezza
3) If the data isn't already in the hash, put it in the hash
4) If the data is in the hash but isn't identical, replace the value in the hash
5) If the data is in the hash and is identical, delete it
6) Iterate over the hash, creating INSERT INTO ... ON UPDATE ...; statements.

Now, I'm not positive I found the perfect solution OR that there isn't a better way, but my current solution is *much* better than what I had written previously.

When you've got a tricky problem to solve, take a walk.

1 comment:

Greg said...

Good advice. I used to take a walk around the block every day at 4pm and eat an apple to clear my head.