Monday, February 9, 2009
There was a minor blockage in a kind of water filter causing moisture to seek into the plant's air system tripping two values that stopped the flow of cold air into the plant's steam generator. One thing you do not want in a nuclear reactor is for the steam generators, for any reason, to overheat. Clever engineers work overtime to ensure that it (A) can't happen and if it does happen, that (B) it cannot happen with out being noticed. This kind of malfunction (overheating) was anticipated, and Three Mile Island had a backup system for just such an event. For reasons we still don't really know the valves for the back up system had been closed by someone, and the indicator which should have signaled that the back up valves were closed was hidden by a repair tag hanging from a switch above it. Fortunately the back up system itself had a back up in the form of a relief valve that should automatically trigger if the heat/pressure gets too high. As luck would have it, the relief valve stuck open instead of closing. The indicator light that should have shown that the relief valve was in the wrong position, at that moment, malfunctioned. (see Charles Perrow's, Normal Accidents)
On March 28, 1979, the world came perilously close to the China Syndrome. There are a number of lessons to be learned from this event, but what I'd like to focus on is that systems we build today tend to be very complicated; and with that complexity comes the possibility for combinations of failures to create unimaginable disasters. In this context the term 'unimaginable' can take two forms; one describing horrificness and the other meaning incomprehensibly unpredictable. Who could have imagined the sequence of events that led to the Three Mile Island event. Who could have predicted the six components and the ten separate events that combined to cause the Apollo 13 disaster?
"You don't understand", I told my manager, "this just can't happen." I began to quote the logical explanation first used by Mr. Spock in the Star Trek episode "Spectre of the Gun" in which Kirk and company found themselves on an alien world about to be killed off by Wyatt Earp and Doc Holiday at a recreation of the OK Corral. Scotty and Spock had developed a contraption that would render them all unconscious at the critical hour, thus saving our heroes from death by six-shooters. The contraption failed and the only explanation was that "the physical laws of the known universe were not in play." That was the explanation I offered to my manager as to why two aircraft were occupying the same space in my control-tower beta-level software. My code was merely reacting to a previously undiscovered malady in the laws of physics! To answer your question, no, he didn't buy it.
Computer systems are necessarily complex - they are at least as complex as the world they attempt to imitate. Complexity is the enemy of architecture which attempts to simplify, beautify, and homogenize (think reuse). Since the systems we build are necessarily complex, we must take steps to deal with complexity. Our requirements and design documents must be easy to understand. Our code should be easy to understand, be self-documenting with simple/clever variable names and APIs, and have ample internal commentary. We should pick common architectures, solution sets, and components and then reuse them again and again - even if newer/better/faster stuff is available. Commonality is the friend of the diagnostician, uniqueness the enemy. Did you know that a commercial airline pilot is only allowed to be certified for one type of aircraft at a time. A Boeing 737 pilot cannot also be certified on an Airbus. Why do you suppose that is?
There are two ways to deal with complexity and as architects we need to embrace both. First - drive out complexity where you can. Design, build, and support simple solutions that meet your needs. Secondly, where complexity must exist, reduce the apparent complexity through better documentation, increased training, and use of standards. Bad stuff happens, and in a world that is becoming increasingly complicated, the bad stuff often occurs when unimaginable combinations of bad luck (i.e. "That would never happen") happens. What are the odds that two different birds would hit both engines of an Airbus A320 with sufficient force to shut both engines down?
Take a look at whatever system you are working on today and start asking, "what would happen if..." Then, assuming you were unavailable (vacation, illness, on a plane) how would your backup discover the root cause and fix it?