By BoLOBOOLNE payday loans

A Tough Engineering Decision

Here’s the scene: It’s 1:30 PM.  In 30 minutes the CEO of your company starts a conference call with analysts to announce quarterly earnings.  PR told you he is going to tell the Wall Street analysts how cool your team’s website is.  It is quite a success — in 18 months it has rocketed from non-existence to the world’s fourth most popular site in a very competitive industry.  Sounds great to get some recognition, right?  Only problem is, today your site’s kinda broken.

The night before a database upgrade got confused half-way through with no possibility to roll back.  One of the two production databases got upgraded to the new schema and the other didn’t.  As you’d spent most of the day diagnosing, the new schema didn’t quite work with your app — some fraction of pages generated from this database came out wrong.  Busted.  Missing.  Scrambled.  Paper white.  Ugh.

After hours of group futzing between you and a couple dozen other folks, you’ve managed to get the problem mitigated.  Your app now appears to be reliably generating correct non-borked pages.  But the site that the world sees is still messed up, because of your content distribution network (CDN) partner.  The CDN caches copies of your site across the world, moving it closer to customers for faster display and reducing the load on your own app servers.  But over the course of the day, the CDN has cached copies of many broken pages.  You can of course clear the individual cache for any broken page you find, causing the CDN to fetch a clean accurate copy from your app servers.  But the site has millions of pages — how are you ever going to find all the pages that need flushing?  With 30 minutes until press time it’s not impossible. 

The only reliable way to clear all the broken pages out of the cache is to wipe clean the whole CDN cache.  Push the big reset button.  This is a fairly big deal because it means millions of cached pages will have to be wiped from the CDN and fetched from the app servers again.  Is there time before the peering eyes of Wall Street come looking?  Clearing the caches takes about 15 minutes.  Filling them back up again — who knows.  The popular stuff will fill in fast, but the long tail will probably take a while.

To make it worse, clearing those caches will mean a big increase in traffic to the app servers.  You’ve hit the button before during code releases.  But always very late at night when traffic is light.  Early afternoon is about as high as traffic gets.  These systems are not the most stable in the world right now — you’re not sure if they’ll survive a cache clear in the middle of the afternoon.  Any web site will slow down with lots of traffic.  But too much traffic and these systems crash.  Break.  Stop working at all.  And often won’t get back up without a lot of help.  Sometimes such crashes will ripple back through dependent systems and it takes hours to figure out what’s happened.  Maybe even take the whole company off-line for a while, and that’s always fun to explain to the execs afterwards.

This is the risk of hitting the big button and clearing the caches.  Best case is the site runs slowly for a while as the caches repopulate.  Worst case, the whole system goes completely south while the analysts are checking it out.  Alternately you could just leave the site in its somewhat-broken but mostly working state for the analysts to look at.

So, what do you do?

A friend from college pointed out to me that engineers get paid for their judgment.  Doing rote calculations doesn’t demand a high salary.  Using your experience and opinion to weigh alternatives does.  Considering the relative merits of trade-offs, especially when the stakes are high — that’s where you really need somebody who is wise and experienced.

I have to digress for a moment to consider what’s really going on here when I say "the stakes are high."  In this industry, a big stupid mistake where you muck with live running machinery that you shouldn’t be means thousands of people don’t get their web page for a while.  Compare this to a friend who makes cheese for a living, and mucked around with live running machinery and got badly hurt.  A mistake on the production web servers potentially could have destroyed millions of dollars of abstract shareholder value.  But nobody was going to get their arm ripped off.  (Warning — these pictures are really gross.)  Anyway…

So what did I do when faced with this dilemma recently?  Me?  I went for it — I hit the button.  And everything was fine.  For a while the site was really slow while the caches refreshed.  Many CPUs were pegged from our app tier back through the databases that the whole company relies on.  But nothing broke.  And when pages finally loaded they looked good.  After about an hour, everything was back to normal.  Most everybody never noticed a thing. 

Just another exciting, adventurous, yet entirely unglamorous day in the life of a software engineer.

  1. leodirac says:

    Yep. Based on just what's shared here I probably would have held off. But my experience with the whole system gave me enough confidence that it would work out that I thought it worth the risk. I knew the CDN's cache efficiency hadn't been super high, and I had faith in the robustness of the database I knew we'd be pummeling hardest, especially since we'd tuned the connection and thread pools to that one to keep the system from swamping itself under heavy load. Like I say it's a judgment call.

  2. Ramez Naam says:

    Great story. I think I would have not pressed the button, on the grounds of choosing the devil I knew over the one I didn't (the potential takedown of the whole site).

    Glad it worked out!

  1. There are no trackbacks for this post yet.