Building a data lake? Tips for advanced data analysis.
Data can now be collected from nearly every corner of an organization’s operation and mined to develop business insight, create new services, or reduce operating expenses. One of the key technology enablers for the next generation of advanced analytics is the data lake. A network visibility solution can help your data lake work more efficiently.
Value of a data lake
Faced with massive volumes and heterogenous types of data, organizations are finding that in order to deliver insights in a timely and cost-efficient manner, they need a data storage and analytics solution that offers more agility and flexibility than traditional data management solutions. The data lake has emerged as a solution for securely storing raw data, both structured and unstructured, in one central repository. Retailers, healthcare providers, and providers of sales and marketing services are actively benefitting from next-generation data analysis. When designed well, a data lake serves as a consolidation point for data from multiple sources and makes data available for faster and more accurate analytic processing.
Over the last five years, data repositories have shifted toward cloud-based architectures to reduce costs and achieve flexibility. Data lakes can be prohibitively expensive to scale, requiring more and more specialized platforms to get the required performance. Plus, many scale-up technologies for on-premise solutions become less reliable and error-prone at tens or hundreds of terabytes—which is increasingly common. Moving things to the cloud also makes it easier to combine company data with other available, industry-specific data sources.
A user example: Fraud detection
A key use case for data lakes is to support advanced security solutions used to identify behavioral anomalies and other indicators of compromise. One example comes from Blackhawk Network, a provider of gift and payment card offerings. Blackhawk had been maintaining a data warehouse on-premises for five years to track their business. They wanted to use that history to help them predict card fraud and reduce losses. They engaged Cloudwick, a pioneer in next generation big data and data science, to help them implement a cloud-based data lake, which serves as the data set for their advanced fraud detection solution. With the new solution, Blackhawk has replaced its complex, ineffective rule-based framework with a more efficient algorithm that detects fraud more quickly. Their scalable and flexible cloud-based data pipelines are extremely cost-effective. For Blackhawk, identifying and investigating fraud with advanced analytics translates into huge savings.
A network visibility platform can fill a data lake
It’s easy to see how a data lake enables analytics, but the devil is in the details. The key challenges are filling the data lake with usable data and enabling efficient access to that data. A network visibility solution—which harvests and processes raw network packets—can be integrated with the data lake to do just that. Network packet data has long been the key source of truth for preventing, detecting, and remediating security attacks. Leveraging this data has become more difficult because of the exploding volume and velocity of network traffic, the complexity of hybrid environments, and the proliferation of siloed security solutions and forensic data sets. A network visibility solution using network taps and a powerful ‘network packet broker’ (NPB) is able to load large volumes of packet and flow data into the data lake, without limit and at low cost.
To ensure the data lake includes all relevant data, a visibility solution begins by tapping and aggregating traffic from every segment of a hybrid IT environment. Ixia’s network visibility solution includes virtual and cloud taps to make sure traffic moving between virtual machines and traffic flowing on cloud infrastructure is also captured and available for analysis. Raw traffic is then processed to eliminate duplications, strip away unnecessary headers, and filter data to create the subset needed for analysis. Ixia’s Vision ONE NPB is able to decrypt SSL traffic; filter traffic based on details such application type, user location or device; and generate NetFlow data when required. Finally, a dedicated and powerful packet broker delivers filtered results at high-speed, without packet loss, to any and all inspection, forensics, and analysis solutions.
Tips for optimizing your data lake
In addition to using a network visibility solutions, you should consider a few additional tips to maximize the value of your data lake.
1. Plan for data diversity
Ensure your data lake is ready to accept data generated from any source, in any format, and in any environment (on-premises or in the cloud). Applications migrate, new services are adopted, and technologies shift; make sure your data lake is designed to accommodate new data sources and new analysis tools, as your needs evolve.
The market intelligence firm, Transforming Data with Intelligence (TDWI) states “the data lake is all about free-form data aggregation…because discovery-oriented exploration and analytics need large samples of data, lightly restructured (if at all) and aggregated from numerous sources.” Network packet data is a particularly rich source of information for cybersecurity analysis since they cannot be modified by hackers. Packets are evidence of actual interactions and an excellent source when looking for anomalous behavior.
2. Keep an eye on performance
You will need to exercise restraint as you build and manage your data lake. Early hype led many users to load large volumes of data into the lake and then let users fend for themselves. ‘Data dumping’ leads to redundant data (which skews outcomes), nonauditable data (which no one trusts), and poor query performance (which undermines the goal). Plus, collecting data you won’t use is expensive and inefficient.
Proper data preparation can accelerate the speed of queries. Many data sources contain a significant volume of white noise—data that is of no value to security solutions or investigations. Data filtering can be applied at the collection point, within the visibility platform, to remove white noise and reduce the volume of data that must be forwarded to forensics and other analysis solutions. Eliminating data that does not need to be analyzed also reduces the cost of processing.
3. Ensure strong governance
Organizations need solid governance as they build out new intelligence and analytics platforms. Along with protecting sensitive data, governance rule and policies are expanding to ensure appropriate data stewardship, data curation, and quality control. Any tasks that can be automated will make it easier for organizations to scale governance as their analysis systems grow in size and complexity.
The visibility platform can play a role here removing or obfuscating sensitive data before it is made available to security analysts or sent to intelligence solutions. This practice will help maintain privacy and satisfy new legislation designed to support “the right to be forgotten.”
A data lake can introduce powerful new insights to strengthen your cybersecurity. Plan your platform for to support a wide diversity of data sources, prepare data to be processed with maximum efficiency, and ensure strong governance over each step of the data analysis process.
Learn more about achieving security in the cloud at: Ensure Cloud Security.
 Cloudwick website: Case Study: Blackhawk Network, accessed online Nov. 15, 2018.
 TDWI: Data Lakes: Purposes, Practices, Patterns, and Platforms, March 29, 2017, accessed at tdwi.org.