Amritam Putatunda
Technical Product Manager
Blog

Want to avoid another Black Friday outage? Let’s enter the metrics.

December 11, 2019 by Amritam Putatunda

As Black Friday and Cyber Monday ring in the holiday shopping season, consumers ever-increasingly ditch the long lines and head to their favorite websites to shop. Go eCommerce! However, this is also the time many websites are either fully or partially down for long periods of time, causing massive revenue loss. This happens on such a regular basis that it’s surprising why more is not done about it.

Why does this keep happening?

Most websites are perfectly architected for regular day-to-day operations. The devices in front of the webservers generally operate at 50% or less. Generally, retailers over-position most of this gear so it can handle sudden spikes that happen from a big offering or a holiday season. However, this still doesn’t prepare them for the traffic loads that can come Black Friday and Cyber Monday.

Are there no lessons learned?

A huge amount of scrutiny, head rolling, and revamping happens every time such incidents happen. During the post outage analysis, online retailers gather the data to understand how short outages were based on the traffic volume seen, where the network started to struggle, and when the actual outage happened. Based on these learnings, steps would be taken—the most common being to overprovision the infrastructure so it can handle the same (or a little more) volume next time. However, there are two major data points missing in this analysis:

  1. They don’t exactly know what the load will be next time.
  2. They can’t test the scale situation until it happens again (or at least that’s what they think). So, there are many assumptions and guesswork in planning the next-generation infrastructure upgrades. For example, one common practice is extrapolation. For example, since 100 simultaneous users exhaust my memory by 4%, I can support 2500 simultaneous users based on the amount of memory I have.

We need to cut the guesswork and the extrapolation

The first element can't be exactly quantified as there is no way to tell exactly how much more traffic they will experience next time. It’s a function of a variety of unknowns like how attractive the deals are, how good the economy is, and what the competition is offering. However, the second element can be quantified much better as there are certain methods to accurately test the scalability of applications and websites without depending on assumptions and extrapolations.

Since we rarely see issues in day-to-day operations of webservers, it’s almost certain that traditional webserver testing has validated features like viewing web pages, logging in and out, and performing e-Commerce operations (which is why they maintain smooth operations 99% of the year). The outages that happen during “special” occasions suggest that what the online retailers are missing is a simulation that performs subsets of all the above operations, only at a scale of millions and with a mix of simulated clients.

Also, don’t forget the attackers

Network high-stress situations are also great opportunities for attackers to cause even further disruptions. This means that only testing scale is not enough, you’ll also need to combine simulated attacks like distributed denial of service (DDoS) and exploits. 

The golden metrics. What are the real test requirements?

Based on our experience, the few companies that are doing it right generally have the below metrics that they test for every time they do an upgrade. The basic idea is to ensure that individual components like web application firewall (WAF), application delivery controllers (ADCs), load balancers, webservers, and database are all tested for their maximum scale in terms of CPU and memory, and also for security efficacy while handing these scales. 

1

A sample of the metrics needed to better ensure smooth operations during the highest-peak times like Black Friday and Cyber Monday

Of course, this is a simplified metrics table and a much more expanded version would be needed depending on the organization’s objectives and the details that they want to validate. For example, the DDoS itself can be broken into several categories like Layer 2/3 DDoS, Layer 4 DDoS, and application DDoS. The point here is that if you do not have enough data points to fill all sections of the metrics, then you already have a problem and your action item here is to take steps to at least understand the metrics shown in the table and then expand as per the needs unique to your website or application. 

Use BreakingPoint to find your breaking point 

Ixia offers BreakingPoint along with the Application and Threat Intelligence subscription service to generate the client-side scaled traffic, DDoS, exploits, and malware that can help you gather and understand the metrics you need to capture the revenues you’re aiming to lose next year when your website goes down.