Performance Benchmark on Disaggregated Networks
Disaggregated networking, where hardware and software are separated, provides significant economies of scale and an open architecture. But is it too good to be true? What are the pros and cons and potential trade-offs?
Traditionally, network switch software came bundled and on the hardware, quality-tested, and supported by a single-vendor. Since disaggregated switch hardware and operating software could come from different vendors, organizations implementing disaggregated networking need to perform performance benchmarking to ensure it all works together as it should.
In this blog, I will apply benchmark test methodologies to qualify the performance of the IP routing information base (RIB) and forwarding information base (FIB) against an integrated implementation from incumbent vendors.
I asked our system QA group to build an IP CLOS data center network with white box switches. Of course, we ordered a couple of popular network operating systems (NOS), and a top-selling leaf switch from an incumbent network equipment manufacturer (NEM) so we could benchmark the differences.
We designed the network and test topology as follows:
- 4 spine switches (white box + NOS)
- 1 leaf switch (could be a white box+NOS or a brand switch) with 4 x 100G Ethernet connected with spine switches
- IxNetwork chassis to emulate a leaf switch L2 (with 4 x 100G ports connected with spines), a rack of simulated servers R2 behind L2, and a rack of servers R1 behind L1 with 100G connected with L1
- Each spine and leaf was configured as EBGP for IP routing and equal cost multi path (ECMP) (4 paths)
I would typically run through the RFC 2544/2889 to qualify the switch fabric performance, but decided to stress the IP CLOS RIB/FIB performance first. I picked the test methodology defined in IETF RFC 7747 section 5.1 RIB-IN convergence. This test case was designed to characterize how quickly BGP routers install routes in RIB and push down to forwarding fabric FIB.
The test starts by pumping traffic (from P1 to L1) toward a set of BGP routes advertise by L2 (emulated by IxNetwork). The time to advertise BGP routes was recorded, and the time traffic converged at the destination ports was recorded by the IxNetwork test ports.
With these two timestamps, we can understand how efficient an IP CLOS network can converge on new routes and the switch fabric accurately forwards traffic to the destination with no packet loss.
Since this is CLOS topology (spine and leaf) with ECMP, we need to consider that the source traffic (100G from P1) will be load-balanced across 4 different paths. A successful convergence should be confirmed by the total throughput across all ECMP ports with no packets dropped.
Because this was a performance benchmark test, we now needed to define a comparable baseline:
- We tested with 5,000 and 10,000 IPv4 routes with /31 mask.
- Each test ran 3 iterations to obtain average number and also ensure the SUT (system under test) could recover reliably after each iteration.
- Lastly, we swapped out L1 with a much more expensive switch from a brand vendor and repeated all tests. I know we should have replaced all spine switches as well, but we had to keep within budget.
Conclusion for RIB-IN Convergence Test
The brand switch with integrated OS and HW presents higher RIB/FIB performance, even though this benchmark test didn’t swap out all the spine switches. I expected the results would be in favor of brand switches conclusively. Now, the question for network designers to answer is: can you live with slower routes convergence to gain flexibility in features add-on/update and no vendor locked in? We also need to consider the correlation of the network scale to CAPEX savings. Another aspect is that you will need new skillsets to implement and maintain disaggregated networks. We may not have all the answers with just one benchmark test, it’s a good start to get the discussion going. Please stay tuned for my upcoming blogs on fabric forwarding performance testing (RFC 2544/2889) and BGP failover testing (RFC 7747 section 5.2).