How to stop your network being grounded by device failures
Flight operations at Southwest Airlines have returned to normal after a router failure caused the cancellation of approximately 2,300 flights over a four-day period. The airline said that a network router failure took down several of its computer systems on the afternoon of Wednesday 20th July, with the outage continuing for around 12 hours when backup systems didn’t work as expected. This, combined with the problems occurring in the middle of the summer travel season, triggered further problems: rebooking of stranded customers was made more difficult because planes were already nearly full, and airline crews were unable to get to their flights after being stranded in other cities.
These incidents show that even a relatively routine problem with a network device can cause unexpected ripple-effects across a network, causing widespread disruption with a very real impact on services, customers, and reputation. Needless to say, the social media fallout from this incident wasn’t pretty.
How can you mitigate the risk of something similar happening in your business? There are two key strategies that should be embedded into your networking and IT security practices:
- Remove single points of failure
The crucial point here is to move away from serial inline deployments, in which traffic is passed from one network or security appliance to the other. If any device fails, the entire traffic path fails and a network outage is the result. This means that relatively small device problems can have a massive impact on business operations – and if and when they do fail, that failure can be difficult to identify and isolate too.
The simple alternative is to use modular bypass switches in front of each firewall and other security appliances. These switches continually monitor all inline devices, ensuring that they are ready to receive traffic. If a device goes down, the switch steers traffic around it until it is back online.
The potential security risk of traffic bypassing a crucial security device is avoided by pairing the bypass switches with network packet brokers (NPBs). This introduces the added ability to see and inspect inside network packets, and route them only to the appliances that are appropriate for that type of traffic. For example, only routing SSL encrypted traffic through devices designed to decrypt, inspect and re-encrypt it. In turn, unnecessary burdens on security devices are eliminated, and both their performances and their lifespans are easier to manage.
Finally, once bypass switches and NPBs are in place, they should be configured for optimum availability, delivering high availability during normal operations, while fully protecting traffic when (and it is when, not if) a device does go down.
- Test networks for robustness
Once you have built a network architecture designed to remove single points of failure and give you comprehensive, real-time visibility into your network traffic and performance, the second principle is straightforward. Your network and devices should be continually tested, not during specific, one-off testing windows, but as part of your standard IT operations. You should test using realistic loads and cyberattacks tactics before any application or network changes are deployed, so that key areas of vulnerability are fixed before they are made live.
The Southwest Airlines incident shows how even a relatively innocuous IT problem can rapidly turn into a major issue. The strategies outlined here or removing single points of failure from networks, and regular network testing, go a long way to mitigating the risk of a similar event grounding your business.