MTTR: The ABCs of Network Visibility
Mean time to repair (MTTR) is a direct measure of troubleshooting time. The lower this number, the better off a business, and its IT department, are. Higher numbers mean that it is takes longer for the network to recover from a problem and more of IT’s time is spent fighting problems. According to ZK Research, 85% of MTTR is spent just trying to figure out that there is indeed a problem. An Enterprise Management Associates report, Network Management Megatrends 2016, found that most IT teams now spend around 36% of their daily efforts on reactive troubleshooting activities.
What Is MTTR?
There seem to be several definitions for MTTR based upon repair, recovery, or something else. I use the definition mean time to repair. This is supported by organizations like ITIL. According to Wikipedia, MTTR is a “basic measure of the maintainability of repairable items. It represents the average time required to repair a failed component or device.”
Typical Use Cases
In regards to network monitoring, there are several specific use cases and instances where MTTR can be lowered by deploying the correct network visibility solutions. Here are some specific situations.
- Reduce or eliminate the need for Change Board approvals – Once taps and network packet brokers (NPBs) are inserted into the network, no more network affecting changes are needed, assuming the deployment was done correctly. Taps are passive devices and will not materially affect network traffic after they are inserted into the network. Security and monitoring tools can then be connected to the NPB at will. This can dramatically speed up troubleshooting diagnostic time as many Change Board approvals can be eliminated. Change Board typically govern the production network and oversee what activities can and cannot be implemented to the network. This is because these changes often causes network disruptions and outages. By eliminating these approvals, the IT department can often start troubleshooting activities immediately. There is no need to wait minutes, hours, or days for approval to connect diagnostic equipment to the network because it is already connected and ready to go.
- Reduce or eliminate the need for crash carts – As just mentioned, once taps and NPBs are inserted into the network, no more network affecting changes are needed, assuming the deployment was done correctly. Security and monitoring tools can then be connected to the NPB at will. This can dramatically speed up troubleshooting diagnostic times as crash carts (special purpose carts with a collection of triage and troubleshooting tools) are no longer required. The tools are pre-connected to the NPB. This eliminates time spent locating the cart, moving the cart to the correct place and inserting into the network, and the configuration of the network tools. This can be especially pertinent if troubleshooting needs to be conducted on links and equipment in remote locations. MTTR reductions of up to 80% are possible simply due to the elimination of Change Board approvals and crash carts.
- Use floating data filters – Specific NPB filters for troubleshooting can be pre-staged and connected to standby troubleshooting tools (e.g., analyzers, Wireshark). This can dramatically cut data collection times as the troubleshooting filter simply needs to be connected to an incoming network port to the NPB. This can be done remotely using a drag-and-drop interface on the NPB. Once the connection is made, the tool can start capturing critical data in less than 1 minute to reduce troubleshooting time and costs.
- Deployment of NPBs that support adaptive monitoring – Adaptive monitoring is the ability of the NPB to respond to network commands and make configuration changes. This automation capability improves monitoring response times by being able to respond to network incidents with actions in near real-time. Commands can be received using a REST interface from network management systems (NMS), orchestration systems, SIEMs, etc. Faster responses to problems result in a shorter mean time to diagnosis and a corresponding faster MTTR.
- SIEMs are deployed in the architecture – This is a specific use case of general automation. NPBs can be integrated with SIEMs to automate threat detection and mitigation automatically. This allows the NPB to respond to SIEM REST calls with actions in near real-time. Specific data can be captured and forwarded to security tools for deeper inspection and analysis. The faster response time to problems results in faster incident detection, faster MTTR, and reduced network security risks.
- Implement proactive troubleshooting with application intelligence – Application intelligence uses application related data to look at additional network data information. For instance, user geolocation, device type, browser type, border gateway protocol assignment information (BGP AS), and application traffic change information can be used to help pinpoint problems. If this information is looked at in conjunction with trouble incident reports, then this can often shorten the time of troubleshooting. For instance, is the problem affecting all devices or operating systems or just specific ones? Are the incidents being reported from geographic area? Is the incident related to a specific carrier or Internet service provider? These data points can be very useful in diagnosing problems.
- Eliminate the use of network switch SPAN ports – SPAN ports can drop data without any notification of the data loss. This includes corrupted data (like malformed packets, frame errors, etc.) which can be useful for troubleshooting. In addition, in switch overload situations, i.e. when there is often a network switch problem, SPAN port data can be dropped without notification as this port has a lower priority than traffic ports. So, the critical troubleshooting data you want my not be forwarded to diagnostic tools and you will never know, unless you see data gaps or attach a network analyzer to the network switch to validate SPAN port output. A tap eliminates this issue because all data forwarded on to the NPB. At this point, you choose what data is removed or not removed.
The following are some things to keep in mind about visibility solutions:
Ease of use – You want the configuration of the troubleshooting data filters to be quick and simple. Time is clearly a primary factor here so make sure that the NPB has a GUI interface and the ability to create both pre-staged (floating) filters and a filter library.
Application intelligence – Make sure that the NPB can support application intelligence and the ability to perform application filtering. Application filtering and monitoring of data can dramatically speed up troubleshooting time.
Use taps instead of SPAN ports – This simple step removes a lot of error resolution time and speeds up time to data acquisition. Taps provide a complete copy of data, so there are no concerns as to whether a SPAN port (or network switch) removed any of the data.
More Information on MTTR and Network Visibility
Further information about MTTR for visibility solutions can be found A Logistics Firm Saves Money and Speeds up Mean Time to Repair 80%. More information about Ixia network performance, network security and network visibility solutions and how they can help generate the insight needed for your business is available on the Ixia website.