Monday, September 7, 2009

Gmail's Down: When Redundancy is Not Enough. What We Can Learn About Fault Tolerance

Last week, Gmail failed for over 30 minutes. Some really smart people keep that service running for millions of users. What happened?!? Here's the story and the lesson we can take away about fault tolerance, business continuity and disaster recovery.

"...we had slightly underestimated the load which some recent changes placed on the request routers,” Ben Treynor, Google site reliability Czar wrote “At about 12:30 p.m. a few of the request routers became overloaded and in effect told the rest of the system 'stop sending us traffic, we’re too slow!.' This transferred the load onto the remaining request routers, causing a few more of them also to become overloaded, and within minutes nearly all of the request routers were overloaded.”

This is an interesting case study. There was no "single point of failure" here, which is what many enterprise's data centers bank on for business continuity. In fact, there were dozens of request routers at Google. As certain routers became overloaded, fail-over to alternate request routers worked exactly as planned. Redundancy, check. Fail-over, check. What could go wrong? The daisy-chaining failure that occurred is an effect that has a very reasonable cause: Not considering the "Factor of Safety."

I was educated as a materials science engineer. Although my degree has limited applicability to my career, there are a few things I learned that I apply to IT systems design. A big one is a structural engineering discipline called the Factor of Safety (FoS). In layman's terms, the FoS is simple. If you're building a bridge to support 100 cars, you have to construct it to support at least 1000. This is a FoS of 10.

If your car is driving over a bridge with 99 trucks carrying a dozen lead pipes, this FoS seems reasonable. You want to be sure you're going to be OK, and a FoS of 1 or 2 doesn't guarantee that. The engineering discipline states that components whose failure could result in substantial financial loss, serious injury, or death can use a safety factor of four or higher -- most often ten, like our bridge.

So what am I getting at? Google didn't have the appropriate FoS applied to their failover/business continuity strategy. I don't know the factor they used, but my guess goes something like this:

We have 10 routers at 75% capacity. Every time a router starts peaking over 80%, it conservatively considers itself overloaded, and passes its load off on to the next nine servers, which quickly peak over 80%. The daisy chain continues until you can't get your e-mail.

Both the FoS and the load-balancing configuration they had with their active router hardware were wrong. They may have felt they were adequately prepared by "engaging" more hardware than was needed. In reality, they needed a FoS of at least four times the number of routers they had.

Business continuity means asking the question "what then?" after the question "what now?" has been answered. It's preparing for the after-shock of the original effect you initially designed to address.

I understand that, after a point, a certain FoS might become unreasonable, but you can still account for it in either your disaster recovery plan or service level agreement. For example, you can state that you can handle a 400% increase in load in your local data centers after which you will need X amount of downtime to divert traffic to secondary, public cloud-based environments. Of, if (God-forbid) a data center is destroyed, that the backup data center may only be able to handle X applications.

So the take away is this. What's your FoS? On a bridge, 2 doesn't cut it. And unfortunately, we're all on that bridge. When we're dealing with possibilities of substantial financial loss, serious injury, or death, my engineering discipline tells me we've got to think harder than ... "We have a DR center."

We can't just think about "what if?" we have to think about "what then?" The answers aren't much different (process-wise or technically) than what you already have, but having it thought through, will make a safer place.


-J

1 comments: