Cloud native network visibility (part 2 in a series)
In the previous post, I discussed why the need for IDS and how you can go about it with a lift-and-shift strategy. This second part of the series will focus on an alternate visibility architecture that is cloud native. I will go through the key characteristics this architecture has that is different from the lift-and-shift approach. You will see how it minimizes requirements on your network architecture in the cloud.
- You can adopt a cloud native visibility architecture.
- Such an architecture can remove bulk of the operations pain introduced by NPBs in the lift-and-shift architecture.
- Such an architecture can harness agility and flexibility in the cloud.
In the last post, I described the challenges with a lift-and-shift approach to harness the agility, flexibility and manageability provided by the public cloud.
Specifically, we want to achieve the following:
- Reduce operational toil of the visibility infrastructure in a volatile, changing environment
- Dynamically scale your taps with your application workloads
- Dynamically scale your IDS with incoming monitored workloads
The diagram below shows a cloud native visibility architecture that addresses these requirements in a natural way.
In the reference architecture above, notice that we removed the NPB data-plane layer completely. The benefits that come with this removal are:
- Latency decreases because there is only one tunnel from source to destination as opposed to two.
- No backhauling of data plane traffic, hence reduced transport and operational costs.
- No bottlenecks introduced by the NPBs. This means the network engineers do not need to manage NPBs nor to deal with dynamic NPB scaling in or out.
Another component of the architecture is a centralized management controller. When it is offered as a managed service in the cloud, you get additional benefits to reduce toil.
- No control-plane infrastructure to maintain and operate by the end user.
- Self-service of visibility policies to the application or security engineers, which reduces the communication and context overhead with network engineers.
Finally, this architecture allows for configuration based on higher level intent. Each agent has access to metadata about the environment it is in. Metadata can be tags, system information, metrics, or version information about the OS, an application, or a specific library. Each agent reports the metadata to the centralized management service. In the management service, you can group similar workloads together based on the metadata as opposed to MAC or IP addresses or subnets like in traditional configurations. As new instances of workloads or IDS come along, each agent connects to the management service and retrieves configs organized by the groups that it belongs to.
To make it clearer, here are few examples of intent-based configuration leverging metadata.
- Send traffic from ‘WebServer’ instances to ‘WebServer’ IDS
- Send traffic from ‘Application XYZ’ instances to ‘Application XYZ’ IDS
- Send traffic from instances with ‘Library XYZ at version 123’ to ‘Zero-day XYZ123’ IDS
Intent-based config reduces toil because you now can define configs once without tight coupling to properties such as MAC or IP addresses which are volatile when your infrastructure or workload churns.
With the centralized management service and the intent and group-based config describe previously, harnessing elasticity of your taps and IDS sensors becomes easier as new instances are automatically categorized. Every workload instance’s agent can autonomously determine the tool instance’s agent that it should send the monitored traffic to.
Let’s look at how scaling works using the “Send traffic from ‘WebServer’ instances to ‘WebServer’ IDS” example from the last section. In the diagram below, I’ve added few more annotations showing what is happening in the control-plan config paths.
- As the application scales with native means of the cloud it resides in (this can be autoscaling on AWS for example) by launching up Virtual Instance 2, because the instance carries the property of ‘App=WebServer’, the centralized management service is able to automatically respond to this new member in the ‘WebServer’ instances group and provide the configuration about the transport and its intended destination ‘WebServer’ IDS.
- As the agent in Virtual Instance 2 receives the config, it knows to route the tapped traffic over to the ‘WebServer’ IDS, which happens to be the same instance where Virtual Instance 1 is sending its tapped traffic to.
Now, assume the CPU utilization on your ‘WebServer’ IDS has reached a critical threshold, it will also scale with native means of the environment. As your second instance of the ‘WebServer’ IDS spins up, it similarly learns of the configs from the centralized service. With the newly added IDS instance, the source ‘WebServer’ instances automatically load balances its traffic over to a corresponding WebServer IDS instance.
As loads decreases from the source instances, your ‘WebServer’ IDS’s autoscaling will detect that, and accordingly scale in to match the load. Similarly, your source instances will react automatically to that change, and route all traffic to the lone instance of your ‘WebServer’ IDS.
To recap, both the lift-and-shift and this software-defined visibility architecture requires an agent to reside with the workload you want to monitor in the cloud. How do you deliver the agent to reside with the workload? A reasonable approach to delivering the agent is in the form of a docker image. Using container as the lowest common denominator provides the following benefits:
- You have the flexibility of agent deployment as long as you have the docker engine in a host and can also work with container orchestrators such as kubernetes.
- You can decouple the resource requirements such as how much CPU slice or memory to allocate for your agents from the application and the IDS workloads.
- You can hide the complexity of how the tunnels and virtual interfaces are configured.
- More fundamentally, you start thinking about composing your workloads in a loosely coupled fashion. Packet tapping becomes a sidecar to your application.
- You can update it easily by locking onto a new container image.
Now that I described an architecture to deal with the challenges in the monitored workload and NPB layer with the lift-and-shift architecture, I will dig deeper into challenges in the tools layer. The question that I want to throw out there is if most of the tools today are delivered as a full stack (i.e. having a sensor network, middleware and UI), is there a way to re-architect the tools to give it more flexibility in composing your own IDS solution? I will explore that in the next post with an example using CloudLens, Snort and ELK.