It’s time for security teams to embrace security data lakes

The average corporate security organization spends $18 million annually but is largely ineffective at preventing breaches, IP theft and data loss. Why? The fragmented approach we’re currently using in the security operations center (SOC) does not work.

Here’s a quick refresher on security operations and how we got where we are today: A decade ago, we protected our applications and websites by monitoring event logs — digital records of every activity that occurred in our cyber environment, ranging from logins to emails to configuration changes. Logs were audited, flags were raised, suspicious activities were investigated, and data was stored for compliance purposes.

The security-driven data stored in a data lake can be in its native format, structured or unstructured, and therefore dimensional, dynamic and heterogeneous, which gives data lakes their distinction and advantage over data warehouses.

As malicious actors and adversaries became more active, and their tactics, techniques and procedures (or TTP’s, in security parlance) grew more sophisticated, simple logging evolved into an approach called “security information and event management” (SIEM), which involves using software to provide real-time analysis of security alerts generated by applications and network hardware. SIEM software uses rule-driven correlation and analytics to turn raw event data into potentially valuable intelligence.

Although it was no magic bullet (it’s challenging to implement and make everything work properly), the ability to find the so-called “needle in the haystack” and identify attacks in progress was a huge step forward.

Today, SIEMs still exist, and the market is largely led by Splunk and IBM QRadar. Of course, the technology has advanced significantly because new use cases emerge constantly. Many companies have finally moved into cloud-native deployments and are leveraging machine learning and sophisticated behavioral analytics. However, new enterprise SIEM deployments are fewer, costs are greater, and — most importantly — the overall needs of the CISO and the hard-working team in the SOC have changed.

New security demands are asking too much of SIEM

First, data has exploded and SIEM is too narrowly focused. The mere collection of security events is no longer sufficient because the aperture on this dataset is too narrow. While there is likely a massive amount of event data to capture and process from your events, you are missing out on vast amounts of additional information such as OSINT (open-source intelligence information), consumable external-threat feeds, and valuable information such as malware and IP reputation databases, as well as reports from dark web activity. There are endless sources of intelligence, far too many for the dated architecture of a SIEM.

Additionally, data exploded alongside costs. Data explosion + hardware + license costs = spiraling total cost of ownership. With so much infrastructure, both physical and virtual, the amount of information being captured has exploded. Machine-generated data has grown at 50x, while the average security budget grows 14% year on year.

The cost to store all of this information makes the SIEM cost-prohibitive. The average cost of a SIEM has skyrocketed to close to $1 million annually, which is only for license and hardware costs. The economics force teams in the SOC to capture and/or retain less information in an attempt to keep costs in check. This causes the effectiveness of the SIEM to become even further reduced. I recently spoke with a SOC team who wanted to query large datasets searching for evidence of fraud, but doing so in Splunk was cost-prohibitive and a slow, arduous process, leading the team to explore alternatives.

The shortcomings of the SIEM approach today are dangerous and terrifying. A recent survey by the Ponemon Institute surveyed almost 600 IT security leaders and found that, despite spending an average of $18.4 million annually and using an average of 47 products, a whopping 53% of IT security leaders “did not know if their products were even working.” It’s clearly time for change.

Security data lakes are the next step in the security architecture evolution

SIEM solutions typically use data stored in data warehouses. A data warehouse comprises “silos” of structured, filtered data that has already been processed for a specific purpose. The process of filtering, modeling, segregating and transferring data from original sources into these compartmentalized storage units is time consuming, expensive and, ultimately, grossly limits the amount of data actually being used for security analytics.

In contrast, the security data lake approach involves centralizing all of your critical threat and event data — no matter the source or format — in a large, central repository with simple access. The security-driven data stored in a data lake can be in its native format, structured or unstructured, and therefore dimensional, dynamic and heterogeneous, which gives data lakes their distinction and advantage over data warehouses.

With a data lake, you can stream all of your security data — including log files, feeds, tables, text files, system logs and more. No data is turned away and everything will be retained. The data lake automates the processing of the data when loaded (known as parsing), making it even easier for the security team to focus on the most critical elements of their job — preventing or stopping an attack.

The data lake approach can be made accessible to a security team at a low cost and is a major evolution in flexibility from data warehouse solutions, which are limited and much less effective in delivering the agility and performance users need.

A security data lake can revolutionize your SOC

So, let’s cut to the chase. If you are building a security data lake, your security team will be able to focus on more strategic activities:

  • Proactive threat hunting: Sophisticated adversaries know how to hide and evade detection from off-the-shelf security solutions. Highly skilled security teams will follow a trigger — which can be a suspicious IP or an event — and find and remediate the attacker before damage occurs. The experience of the threat-hunting team is the most critical element for success; however, they are highly reliant on vast amounts of threat intelligence data so they can cross-reference what they are observing internally with the latest threat intelligence to correlate and detect a real attack.
  • Data-driven investigations: Whenever suspicious activity is detected, analysts begin an investigation. To be effective, this must be an expeditious process. With the industry average of 47 security products in use in the typical organization, this makes it difficult to gain access to all of the relevant data. However, with a security data lake, you stream all of your reconnaissance into your data lake and eliminate the time-consuming work of collecting logs. The value of the process is to compare newly observed behavior with historical trends, sometimes comparing to datasets spanning 10 years. This would be cost-prohibitive in a traditional SIEM.

What software taps the power of the security data lake?

If you are planning on deploying a security data lake, you should know that you’ll need some help, as no pure plug-and-play solutions exist yet. Until then, here are three cutting-edge companies you should know about. (I am not an employee of any of these companies, but I am familiar with them and believe that each will change our industry in a meaningful way and can transform your own security data lake initiative.)

Team Cymru

Team Cymru is the most powerful security company you have yet to hear of. It has assembled a global network of sensors that “listen” to IP-based traffic on the internet as it passes through ISPs and can “see” — and therefore know — more than anyone in a typical SOC.

It built the company by selling this data to large public security companies such as Crowdstrike, FireEye, Microsoft and now Palo Alto Networks, with the recent acquisition of Expanse, which they snapped up for $800 million. In addition, cutting-edge SOC teams at JPMC and Walmart are embracing what I espouse in this very column and leverage Cymru’s telemetry data feed. Now you can get access to this same data. You will want their 50-plus data types and 10-plus years of intelligence inside of your data lake to help your team better identify adversaries and bad actors based on certain traits such as IP or other signatures.

Varada.io

The entire value of a security data lake is easy, rapid and unfettered access to vast amounts of information. It eliminates the need to move and duplicate data and offers the agility and flexibility users demand. As data lakes grow, queries become slower and require extensive data ops to meet business requirements. Cloud storage may be cheap, but compute becomes expensive quickly as query engines are most often based on full scans.

Varada solved this problem by autonomously indexing all critical data in any dimension. Accelerated data is kept closer to the SOC — on SSD volumes — in its granular form so that data consumers can leverage the ultimate flexibility in running any query whenever they need. The benefit is a query response time up to 100x faster at a much cheaper rate by avoiding time-consuming full scans. This enables the search for attack indicators, post-incident investigation, integrity monitoring and threat-hunting. In short, Varada can help your team gain access to the data they need, get consistent and interactive performance, and stop worrying about managing usage costs or dealing with data ops.

Panther Labs

Snowflake is a wildly popular data platform primarily focused on midmarket to enterprise departmental use. It is not a SIEM and has no security capabilities. Along came experienced security engineers from AWS and Airbnb and created Panther Labs, a modern, cloud-first security platform for easily streaming all security data into a single data lake, making detection easy and fast, which is critical for incident response time investigations.

The company recently connected Panther with Snowflake and is able to join data between the two platforms to make Snowflake a “next-generation SIEM” or evolve Snowflake into a security data lake. It is still a new solution, but I have already seen large Splunk customers switch to Panther. It’s a cool idea with a lot of promise for the future of the SOC.

Security teams have almost universally recognized that they are losing against the bad guys. The reduced reliance on the SIEM is well underway, along with many other changes. The SIEM is not going away overnight, but its role is changing rapidly, and it has a new partner in the SOC — the security data lake.

While not a simple “off the shelf” approach, the security data lake centralizes all of your critical threat and event data in a large central repository with simple access. It can still leverage an existing SIEM, but the market is already bringing data-lake-native solutions that are much more flexible and efficient. The security data lake is an exciting step you should be considering.

Dan Shoenbaum has had advising relationships with Varada and Panther Labs.