Search This Blog


Friday, August 2, 2013

Uh oh, the system went down: 5 rules for better troubleshooting

Troubleshooting a downed mission-critical system can be terrifying, but a slow, methodical approach can save you time 

If you've been in IT for more than a few minutes, chances are you've seen it happen: A mission-critical production system falls flat on its face, and you have absolutely no idea why or how to even begin to fix it. Moments of true terror punctuating the monotony of too many project meetings, application rollouts, and systems upgrades is really what makes IT interesting -- and one reason why it's not for everyone.
The troubleshooting process of seemingly inexplicable failures can be one of the most stressful parts of the job. Unplanned downtime of a mission-critical system can invite the harshest scrutiny from coworkers and management in even the smallest of organizations, and it only gets worse as the size of the enterprise grows and the stakes get higher. That additional pressure often leads even the best engineers to make very dumb mistakes, further compounding the problem and prolonging the downtime.
[ InfoWorld's Paul Venezia is no stranger to IT crises. Live and learn from his experiences: "The OS installation from hell " • "When virtualization become your own worst enemy " • "Mission impossible: A remote network cutover " | Managing backup infrastructure right is not so simple.InfoWorld's expert contributors show you how to get it right in this "Backup Infrastructure Deep Dive " PDF guide. ]
Staying cool under pressure isn't easy no matter how many times you've been tossed into the fire, but there are five easy rules you can add to your emergency troubleshooting processes to get to a resolution faster, conclusively prove the cause of the outage, and avoid making things worse.
1. Do no harm
When presented with a seemingly incomprehensible problem, a natural first instinct is to dive into the problem headlong and start making a raft of changes to try a quick fix. Although this often works, and works quickly, it is just as likely to make things fabulously worse. Troubleshooting measures like rebooting an unstable system or trying automated database or file system repairs may well fix the problem and return the system to production, but they also might give up your best chance of recovering data, destroy any hope of determining the root cause, and substantially prolong the outage as a result.
Instead, the best first step in an unexplained outage is one that feels the least natural: Take a step back and carefully consider how you will undo everything you're about to try in an attempt to fix the system. That might mean making configuration backups, taking virtual machine or SAN snapshots, making copies of log files that might be lost or overwritten, and copying potentially corrupted data to an unaffected system. Doing this as a first step feels wrong because it takes up valuable time when stress levels are at their highest and because it doesn't directly do anything to solve the problem. But it does accomplish two very important goals.
First, if your troubleshooting does end up making things worse -- such as if that server you decide to reboot simply won't come back up at all -- you'll be that much more prepared to get current data stood up on a new system.
Second, if your first round of emergency button-mashing does somehow solve the problem, you'll have the data you need to reconstruct the problem so that you can try to figure out the cause later. The only thing worse than not being able to figure out how to fix a problem is fixing it without knowing how or why your fix worked -- not only will you not be able to explain the event to the masses hovering outside your cube, but you also won't be able to offer any real guarantee that it won't happen again.
2. Take notes
The next thing you want to be sure to do is to make a fairly detailed log of what you observed and what troubleshooting steps you tried -- including the time wherever you can. As in the first step, this step seems as if it would sap valuable minutes in an already time-starved situation, but in fact it can save a substantial amount of time in the long run.
First, it prevents you from going in circles and trying the same things over and over -- which happens frequently when stress levels are high. Second, if you have to involve the vendor, you'll have a comprehensive list of what you've already done so that the support folks don't have you do it all over again. Third, if you find yourself pawing through error logs, you'll be able to line up the time stamps of when you tried various fixes to the time stamps in the logs. Without that, you'll often be forced to retry the troubleshooting steps so that you can isolate the log entries they generate -- costing you more time in the end.
3. Research carefully
If you're back is really up against the wall, you'll inevitably find yourself grasping at straws when researching the problem (in other words, Googling).Unless you have an incredibly specific error on your hands, chances are you'll find several people posting that they've experienced a problem similar to the one you're stuck in.
The most important thing to do here is be very critical when you review those apparently close fits. In many cases, you'll discover that, although the symptom is the same, the circumstances are entirely different. I've seen massive amounts of time wasted in chasing the implementation of a fix for a completely unrelated problem -- a situation that could have been avoided by more careful review of the problem description.
4. Share what you know
If you're working as part of a team attacking the same problem or are staving off an angry mob of users, you'll quickly find that communication is very important -- both to keep users informed of what you're doing (and that you are, in fact, doing something) and to keep team members from stepping on each others' toes in the melee.
In a large team, a good first step is to designate someone that you'll keep up to date with what you're doing; that person can then communicate with the affected user community and keep all team members apprised of what the others are doing.
5. Be prepared
Although there's really no way to truly prepare yourself for unforeseen problems, there are many steps you can take right now that will save you tons of time should the unexpected happens tomorrow.
For example, if you find yourself troubleshooting networking problems, have a laptop configured with a protocol analyzer, such as Wireshark, plugged into a port on your core switch. If you ever need it to troubleshoot a network problem, you'll be 15 to 20 minutes closer to getting it rolling rather than scrambling to get it set up in the heat of the moment. Having centralized network monitoring and logging tools in place will make it far easier to correlate different application and network events and to narrow down an inexplicable problem to its root cause.
Putting it all together
Troubleshooting in a high-stress environment is both the least fun (in the moment) and most rewarding (afterward) experience in IT. That surge of panic and adrenaline is unlike just about anything else when working in the relative physical safety of a cubicle or data center. However, that stress will also lead you to make stupid mistakes if you don't resist the urge to jump in with both feet and force yourself to approach the problem methodically.
Now you know how to be methodical while the adrenaline is pumping.

No comments:

Post a Comment

wibiya widget

Disqus for Surut Shah

Web Analytics