Virtual Private Cloud: Are You Getting Your Money’s Worth?
If you are running a virtual private cloud (VPC) within a public cloud like Amazon Web Services (AWS) or Azure, there are certain things you know you will get. If you order a specific machine for instance, it is almost certain you will get a machine with the compute resources advertised by the provider.
The more uncertain factor is the network. Will you get enough bandwidth, latency, and predictability of network performance? It is not a simple question, because there are many paths within a VPC and between the VPC and other remote elements, where network performance might be an issue.
Poor or unpredictable network performance in your data center can hurt user experience, breach service level agreements (SLAs), cause connections to time out, and lead to failures in interconnected systems. It is exactly the same in your virtual private cloud, just harder to measure and optimize, and with additional “network hops” to account for, such as the hop users make from a local network to the public cloud.
Cloud providers like Amazon, Azure, and RackSpace do not provide guarantees as to network performance. It is up to you, the cloud user, to make sure you are getting your money’s worth. If there is a performance issue, you need to do something about it—optimize on your own, work it out with your cloud provider, or, in extreme cases, even move off the public cloud.
What Affects Network Performance in Your VPC?
There are a few important factors that can dramatically affect network performance:
- Physical location of your machine instances in the data center
- Resources shared by other cloud users—noisy neighbors
- The network interface that sends traffic from your machine instance to the network
- Connections to cloud services outside your VPC network, such as Amazon S3
- Connection to external systems outside your public cloud
- User connections from your on-premises network to the virtual private cloud
Below we explain in more detail each of these factors. Our discussion is partly based on an enlightening talk at AWS Re: Invent (SDD419) by Becky Weiss, Principal Software Engineer at Amazon EC2 Networking.1
We chose to focus on Amazon’s virtual Private Cloud, because of its popularity, but the discussion is also applicable to other cloud providers. For a comparison of VPC features between Amazon and Microsoft Azure, refer to this blog post from 8K Miles.
It is well known that network performance is affected by the physical location of the computers communicating with each other. The closer the two machines are, and the less physical wiring, the better network latency is likely to be. Amazon offers a feature called “Placement Groups.” This allows you to request, when launching machine instances, that they should be run close to each other in Amazon’s data center and participate in a low-latency, 10Gbps network. This can significantly improve latency, however, as Amazon’s Becky Weiss says, “your mileage may vary,” and even with this feature turned on, you cannot know in advance how much faster it will be.
*Source: AWS Re:Invent SDD419 Presentation
“Noisy neighbors” are a well-known problem in the public cloud. Within public cloud infrastructure, physical machines are typically used to host several virtual machines belonging to different users. These are your “neighbors.” The cloud providers are able to accurately divide computer power between the different virtual machines (VM), so your neighbors won’t “steal” your computer processing unit (CPU) and memory. However, disk I/O and network I/O can be strongly affected.
If your neighbors are “noisy,” with respect to disk and network—make frequent use of these interfaces—your VMs will experience performance issues. It is difficult to tell how many physical Network Interface Cards your physical machine has and how they are divided between the VMs running on the machine.
Noisy neighbors is a problem that is unpredictable and there is no indication it is going on except for the symptoms of reduced performance. There are solutions, such as the Dedicated Hosts feature on Amazon, or setting up automation to detect performance issues and drop the affected VM, moving workloads to another one, but both of these solutions have a significant cost.
A critical element in network performance on the cloud is the “virtual network interface controller (NIC),” a piece of software that takes network packets transmitted by software in the machine instance and transfers it out to the physical network. The following diagram shows how this works on Amazon EC2.
*Source: AWS Re:Invent SDD419 Presentation
It requires some CPU cycles to move the packet from the machine instance to the virtualization layer, and from there to the physical NIC, which sends the packet out to the physical network. This takes time, and causes an inherent loss of CPU performance.
Amazon introduced a more efficient model called “Enhanced Networking” in some of their machine instances. Enhanced Networking instances run on machines equipped with Single Root I/O Virtualization (SR-IOV) technology, which connects the machine instance directly to a physical network without going through the virtual NIC. Look at the performance difference:
*Source: AWS Re:Invent SDD419 Presentation
As you go from the 50th percentile of network packets in terms of performance to the 90th, 99th percentile and so on, the pink bars, which show “Enhanced Networking,” offer dramatically higher network performance and much more predictable performance.
It is not just a matter of the network being “slow.” Virtual NICs can cause jitter or unpredictable network latency, which can have a big impact on unified communications and rich media applications like voice and video. For example, if you plan to pipe Voice over Internet Protocol (VoIP) traffic through a cloud server, you might experience noise and interruptions in calls due to the virtual NIC’s unpredictable performance.
Clearly it is crucial to know if some sort of “enhanced” networking is offered for your specific cloud provider and machine instances, and how well it is performing, otherwise you can experience a major loss of performance.
It is a frequent use case to have VPC machine instances connect to other cloud services—Amazon S3, Azure Block Storage, database services, etc. It is possible to host these services within the VPC, but often users will want to leverage the Platform as a Service features of a cloud like Amazon, for example, to store big data on Amazon S3 and scale up easily without worrying about disk space, or to get highly available MySQL delivered as a service as in Amazon Relational Database Service.
- When the VPC connects to other services on the same cloud, there are two ways the traffic can flow. You can set up an “endpoint” on a machine instance inside your private network and use that to directly communicate with cloud services.
- Or, in Amazon, you can go through the Network Address Translation (NAT) machine instance that facilitates communication with the public Internet. Naturally, the former option will be faster.
Even with a direct connection, there is a big network hop within the cloud provider’s data centers to enable you to pull data from S3 into your virtual private cloud. If you are pulling a lot of data, or need the data very fast, this can become an issue.
Your VPCs required network bandwidth will be driven by the number of simultaneous users accessing your VPC and the amount of data they transfer to and from the cloud. If there is a machine performing frequent communications with an external service—for example, a representational state transfer application program interface (REST API)—that also needs to be taken into account in your total bandwidth. You need a clear understanding of the bandwidth requirements of each machine instance on your VPC and what is the bandwidth you are receiving in practice, which can change from time to time.
It is not only about bandwidth. As we mentioned above, there are several issues on the cloud that can effect latency and jitter. If you are running applications or services that are sensitive to either of these—for example, video, voice, or gaming—you need to watch these indicators as well.
Because so many factors affect VPC network performance, most of which are outside your control, it is important to test it to make sure you are getting your money’s worth for your cloud investment.
Ixia is a leader in network testing and visualization, helping some of the world’s largest organizations test their networks and applications in traditional, virtualized, and cloud environments. Based on our experience, we wrote a quick guide that will help you understand what to measure in VPC performance, how to do a basic performance measurement on your own with free open source tools, and how to conduct a comprehensive measurement as part of a continuous testing process using specialized equipment.