Imagine if the next system you installed just never failed. Ever. It just worked again and again and again. Whether you deploy servers, network systems, or application software - is this not your dream? To be able to deploy solutions that users use without fail. It is possible; but not by insanely doing the same things we do today and expecting a different result.
I just finished a book by Gartner Research Analyst Kenneth McGee called Heads Up that makes a strong case that for every disaster, be it business disaster, natural disaster, economic, or even terrorist related, there are always warning signs. After the fact we are always able to point out the predictors, the warnings, and telegraphs that signaled impending catastrophe. Far too frequently we ignore these signs or worse yet we can't pick them out amongst the chatter of static and noise we call data.
The first thing we do after every terrible befallment is to ask how we can prevent a re-occurrence. McGee cites many instances of disasters after which we have changed our mode of operations, focused on the relevant data and learned to avoid a repeat. The 1906 hurricane that leveled Galveston, Texas, the water valve that almost caused a meltdown at the Three Mile Island Nuclear Facility, and the O-Rings that ruptured in cold weather and destroyed the Space Shuttle Challenger all led to investigations and changes to prevent another disaster (OK, so Galveston took a while).
He methodically challenges the assumptions that there is too much data to analyze, that life is just too unpredictable, and that surprises are a natural part of the business world. If you ever get a chance, it is fascinating to read the press releases distributed by the National Oceanic and Atmospheric Administration (NOAA) prior to the arrival of Hurricane Katrina. Even though NOAA offered information such as this, many people claim that they were not warned of the severity of the storm.
Think about the world of Information Technology and how many times we reboot servers or restart applications as a proactive intervention to avoid a system crash, application freeze, and the ire of our user communities. I've even seen cases where a scheduled application restart is described as a fix. Um, er, restarting an application is not an acceptable fix - it is at best a stop gap measure and at worst the deliberate denial of a problem.
The reality is that we should be able to predict the majority of our application failures. But the first step is not accepting stupid IT tricks as remedies. The next time you have to fix a bug, take the time to collect data and symptoms about the problem, develop a hypothesis, formulate a response, and test thoroughly. But, don't put your fix into production just yet. You need to understand how the bug got past your certification process the first time.
Consider the programmers who work in the Johnson Space Center in Texas who write code for the Space Shuttle. They know how to write code, remarkably well. Here is an exert from an article about this team (Full copy here):
What makes it remarkable is how well the software works. This software never crashes. It never needs to be re-booted. This software is bug-free. It is perfect, as perfect as human beings have achieved. Consider these stats: the last three versions of the program -- each 420,000 lines long-had just one error each. The last 11 versions of this software had a total of 17 errors. Commercial programs of equivalent complexity would have 5,000 errors.
How do you get to a point that your code has only one bug per 400,000 lines? Process. When you find a bug, you don't fix the broken code - you fix the process that led to the broken code. Think about your current situation. You have a process that generated a bug, and another process that allowed the bug to slip into production. Aren't the errors in those two processes more important than the software bug itself. Shouldn't more important bugs be addressed before lesser bugs?
So, stop fixing bugs. Instead fix the process problems that allow bugs to be created. Fix the process errors that allow bugs to slip by QA testing and into production. Before you know it, you will have systems that just never fail.