Failure Terminology–Cisco Troubleshooting
While it is often tempting to start putting a new piece of flat-pack furniture together without bothering to read the long instruction sheet, it is almost always a better idea to understand how the parts are labeled and how they all fit together before using hammers and screwdrivers. Troubleshooting is no different; it is deceptively simple on the outside but easily leads to long sessions of wasted time.
This section begins with the labels—which part is what—and then moves into the half-split troubleshooting process.
What Is a Failure?
The first thing network engineers need to do in terms of failures is decide
• What is a failure?
• Among all possible failures, which one is the most important?
The first question might seem odd. We all know what a failure is when we see it, right? Consider this, however. Which of the following would be considered a failure in a network of 10,000 devices, and which would not?
• None of the 10,000 devices connected to the network can communicate with any other device.
• 5,000 of the 10,000 devices connected to the network can communicate with any other device.
• 2,000 of the 10,000 hosts cannot connect to a specific web server.
• 200 of the 10,000 hosts cannot connect to a specific web server.
• 200 of the 10,000 hosts report poor performance when connecting to a specific web server.
Almost anyone looking at this list will classify the first item as a “failure” but not the last item. However, even the last item on the list might fail if the application is critical. Defining what a failure is can be extremely important. The failures in this list might also make a good place to start defining the triage list.
Triage determines how important a failure is and in what order operators should work on failures.
The first failure on this list might require calling out every available network engineer to troubleshoot different network parts, while the last might be something an engineer can look at in a couple of days. Triage plays a vital role in organizing the day-to-day work of a network operations center (NOC).
Failure Frequency
Beyond the word failure, there are a few essential terms network engineers should know. Figure 22-1 illustrates these critical terms.
Figure 22-1 Important Failure Terms
• Mean time between failures (MTBF) is the time between the last and current failures. Engineers often need to consider the MTBF for failures of this kind, any failure of this severity, and any failure. These ways of measuring MTBF help determine if an individual component or part of the network is unstable or the entire network is unstable.
• Mean time between mistakes (MTBM) is a little tongue-in-cheek, but human mistakes cause most information technology failures. A short MTBM, or a high mistake frequency, might indicate a fragile or overly complex system.
• Dwell time is the amount of time the failure exists before being detected. The dwell time should be determined as part of the failure post-mortem.
• Mean time to repair (MTTR ) is how long the service, network, etc., was unavailable. The MTTR might be measured from when the failure occurs, so it includes the dwell time, or it might be measured from when the failure is reported or discovered.
• Mean time to innocence (MTTI ) is another tongue-in-cheek term. It describes the time it takes for the network team to prove the problem is not the network.
Network engineers imported many of these terms from other fields, so you will probably hear them in other contexts. For instance, MTBF is a common system engineering term, and dwell time is a typical security engineering term.