Jun 14, 2012

Working hard versus working smart

It isn’t every day that I get one of those moments of clarity. (Actually, if you’re a recruiter/interviewer reading this, that’s just narrative hyperbole! I get epiphanies, like, hourly. Yeah. Hire me.)

I’ve been tasked lately with some rather… interesting… problems at work. Really the circumstances around them drive me nuts and will have me bald from ripping my hair out but I can’t and won’t get into that here. I’m having to generate a lot of records to be imported into this other system as a sort of systems integration thing. Due to the fundamental differences in their database structures, it involves a lot of redundancy.

It also involves a lot of Cartesian products. What we would normally implement as a pair (or more) of drop-down lists to pick from we’re having to store as related elements. Allow me to illustrate with a contrived example.

Who With Where
Miss Scarlett Candlestick Kitchen
Colonel Mustard Knife Ballroom
Mrs. White Lead Pipe Conservatory
Mr. Green Revolver Dining Room
Mrs. Peacock Rope Lounge
Professor Plum Wrench Hall
    Study
    Library
    Billiard Room

Yep, it’s Clue (or Cluedo for the rest of the world). In the game you can end up with any possible combination of murderer, weapon, and location. Really, you just need to pick one of each, all three are required, and they need to be real choices (no choosing Mr. T with the Gold Chain in the Weight Room). Hence the three drop-down lists, which provide a familiar widget and are inherently bounded to valid values.

But again, we’re having to store them as related chains with each foreign keying onto the next. Choice 1 has a one→many relationship with choice 2, and choice 2 a one→many relationship with choice 3. Hence the Cartesian products:

SELECT s.Name, w.Name, l.Name
FROM Cluedo.dbo.Suspects AS s,
 Cluedo.dbo.Weapons AS w,
 Cluedo.dbo.Locations AS l

In this contrived example, that’s 6 * 6 * 9 = 324 combinations. So 21 records across three tables get expanded into 324 records for the import. You can see how this might quickly become an unbelievably huge file!

It wouldn’t be so bad if these were actually related things! Imagine the classic car picker: Make, Model, Trim. You still might have a more clever way to store them, but a simple Model table with MakeId foreign key to the Make table and so on wouldn’t get you any verbal beatings.

Then came part two. After the initial load we would need to generate, daily, a diff import, deactivating no longer valid combinations and adding any new ones. Somebody changing the name of Mr. Green to Reverend Green would incur two-fold changes in the diff, first to deactivate the Mr. Green combos and second to activate/add the Reverend Green combos.

At first, I admit, I was terrified at the prospects of having to write the logic that would track the changes, compare the changes, and figure out the right new records to make as a result. We started with assuming the first part could be handled by SQL Server’s Change Data Capture feature. But that still left the logic to query the changes from CDC and make sense of it all. I quickly determined that this would not only take a lot of time but would be a maintenance nightmare. And the import was for a new system that was very important. Like, paying salaries important. Yeah. Can’t have it not working. Or wrong. And did I mention it needed to be ready ASAP?

So I paused and took a step back. What if I stopped approaching it so straightforward? As a developer, we can sometimes fail to see the forest for the trees. It is a very detail-oriented thing we do, and it helps to sometimes take a moment and get some perspective.

I took note of a few things. Firstly, the only thing that mattered was that I generate an accurate diff file. How I got that specifically was unimportant to the needs of the new system, only that it got done daily and correctly. Of course, my choice as to how would greatly influence the time spent on it and the likelihood of problems.

My initial choice to use CDC and hand-write a lot of the logic put a lot of the onus on the developer. It required me to know a lot of things, to write the logic carefully and correctly, and so on. It occurred to me that I should, if at all possible, shift as much of the onus of being correct onto the computer.

Because computers are really, really good at crunching numbers!

They are great at devouring huge datasets and spitting out results procedurally. And since they are really an oversized calculator, they will always do the math correctly, every time.

In the end, I solved it this way: generate everything again, from scratch, and then do some “simple” intersections (think WHERE NOT IN). I say simple because they are a basic concept in set theory. I say “simple” because they would be intersections involving millions of rows against millions of rows. But, hey, remember: computers are really good at tackling huge datasets! They’re such good sports about it, too, never complain…

Turns out, doing the whole process over, however excessive on memory and file space to store everything and mash everything together in those big in-memory Venn diagrams, it gives the correct answer. And it’s really, really simple for me to write. Just a few SQL statements! Suddenly, the onus on me, the developer, is practically non-existent, and the onus on the computer to churn out an answer and do a lot of thinking (which it does best!) is great.

And that, my friends, is an example of working hard versus working smart.

No comments:

Post a Comment