Why Your Security Data Lake Project Will SUCCEED!

Omer Singer
7 min readOct 3, 2022

The incredible pace of innovation in cloud data platform technology has brought security data lakes back into the spotlight. Security teams are increasingly moving away from standalone SIEMs towards using the same platform as the rest of the business for cost-effective storage and analytics at scale.

But is generic big data technology actually useful for most security teams? Not everyone is convinced.

Anton Chuvakin, who defined much of how the world thinks about SIEM during his time at Gartner, tweeted a reminder that he is really not a fan of the security data lake model.

What has been his take on security data lakes? In 2017, Chuvakin wrote a post titled “Why Your Security Data Lake Will Fail!

As you can guess from the title, he warned back then that:

…for some reason organizations think that they can build A SECURITY DATA LAKE and/or their own CUSTOM BIG DATA SECURITY ANALYTICS tools. Let me tell you what will happen — it will FAIL.

And back to tweets from last week:

So either something has changed to make security data lakes feasible or people (including myself) are being naive and a bit silly. Which one is it? Let’s first recap what security data lakes are and why Anton was right about them in 2017.

Security Data Lakes In The Past

Ten years ago, security data lakes were intended to serve as an alternate home for threat detection and response. SIEM tools were then, as now, the main place where events from across systems and tools were combined for the SOC. It’s interesting to look back a decade and see how many of the SIEM leaders in 2013 are still popular, or at least prevalent.

Security data lake proponents in the time of Anton’s original posts presented an alternative approach that sparked the imagination of many in the upper echelon of security operations centers. Could “big data” be applied to cybersecurity? In those days, that meant Hadoop.

With the promise of cheap storage and open source licensing, Hadoop would break the shackles of costly ingest and limited retention. Threats would be detected quickly with “a scalable advanced security analytics framework built with the Hadoop Community” (Apache Metron). There would bloom an “ecosystem of ML-based applications that can run simultaneously on a single, shared, enriched data set to provide organizations with maximum analytic flexibility” (Apache Spot).

Open source security data lake project circa 2015

Like the Summer of Love in ‘67, the hopeful optimism of the security data lake proponents in 2016 was soon shattered.

Hadoop vendor Cloudera promoting security data lakes in 2016

It turned out that security teams didn’t have time for a science project like Apache Spot or Metron. There was too much overhead in setting up and maintaining these systems, and not enough useful content being released. The Hadoop data lake was causing enough headaches for the core business units that depended on it, and security had alternatives available in purpose-built SIEM and log management systems.

As Anton presciently wrote in 2017:

To conclude, successful custom big data security analytics efforts remain rare outliers, like a flying car. My 2012 post was full of hope — and sadly it didn’t work out. At this point, it is very clear to me that DIY or open source is NOT the way to go for security analytics. Sure, we will continue watching both Spot and Metron, but frankly at this point I am a skeptic.

By the following year, both projects were dead.

RIP Apache Spot

Cloud data platforms change the game

With the failure of Hadoop and its open source security analytics platforms, skeptics can be excused for writing off the whole concept of using general purpose big data technology for threat detection and response. As one tweeter pointed out recently, “it has never had anything at all to do with the technology — the tech was basically the same. It had to do with everything else. Everything in this 2017 post is still valid today on the cloud.”

In that blog post, security data lakes were described as impractical and doomed to failure. The reasons ranged from collecting the data (“SIEM vendors spent 10+ years debugging their collectors for a reason”), getting security value from the data (“somebody hired a big data company to build a security data lake; they build all the plumbing and said “ah, security use cases? you do it!” and left”), and even just keeping the lights on (“data went in — plonk!- and now nobody knows how to get it out to do analysis”).

To understand what’s changed, we need to understand what came after the plug was pulled on Hadoop. Reliable, petabyte-scale data storage and compute delivered as a service. Snowflake, Redshift and BigQuery are leaders in the space.

Where every Hadoop deployment was a custom project with stability and performance varying widely by deployment, the modern cloud data platform ensures a consistent experience for Goldman Sachs, DoorDash, and you if you register with an email address at signup.snowflake.com.

This change implies that software vendors can trust that your data platform will deliver as good an experience as whatever they were going to run under the hood of their application. While past SIEM vendors needed to build custom databases and manage them on your behalf, now they can just plug into your data lake. This is referred to as the connected application, or warehouse-native, deployment model.

Source: “Connected Apps: The Missing Layer in the Modern Data Stack” by Arunim Samat

Cybersecurity has been among the first fields to embrace the connected application model- not least because standalone SIEMs are so challenged by today’s environment that they’ve created immense motivation for an alternative that works.

Security Data Lakes Now

The rapid adoption of cloud data platforms across enterprises of all sizes set the foundation for security data lakes to succeed this time around. Like Hadoop in the past, Snowflake and others offer cheap storage and support for the kind of semi-structured data that security teams need to analyze at scale. Unlike Hadoop, however, modern cloud data platforms don’t require a team of database administrators to build and maintain.

More importantly, the “SaaS-ification” of big data has enabled new SOC platforms to launch as connected applications. All of the connectors, content and interfaces that standalone SIEM solutions providers provided by this new generation of security products on top of the cloud data platform. Having a cost-effective and scalable security data lake no longer means having to DIY the security layer.

SOC platform running on a Snowflake security data lake

This is what’s changed and what the skeptics don’t yet understand. Big data delivered as a service has been operationalized for security teams by a growing number of highly effective security products.

Solutions like Panther, Securonix, Anvilogic and Hunters are able to serve as the SOC platform that turns any Snowflake deployment into a security data lake. Unlike Spot and Metron, these solutions have collectively raised over a billion dollars and are valued by private markets in the billions. One reason why they are being chosen over legacy incumbents in the SIEM market is that these providers don’t have to spend precious cycles developing and maintaining their data backend.

Compare that to IBM QRadar, for example, which relies on its own database called “Ariel” to store and analyze event data. There is no chance that the developers of Ariel are able to keep up with the engineers building Snowflake, Databricks or BigQuery. None. For example, documentation for QRadar warns that “If you don’t configure retention buckets for the tenant, the data is automatically placed in the default retention bucket for the tenant. The default retention period is 30 days, unless you configure a tenant-specific retention bucket.” (source) Modern cloud data platforms have for years eliminated this kind of limitation by separating storage from compute and leveraging cloud-native storage.

Your Security Data Lake Will Succeed

As a result of the democratization of big data platforms, and the best-of-breed SOC platforms that plug into any Snowflake deployment, there is a growing wave of security data lake success stories. Like never before, security programs of all sizes are successfully centralizing their data in a security data lake.

From the hottest startups:

To massive federal agencies:

The security data lake model has gone from being a doomed science project to the most straightforward way to achieve detection and response coverage at scale.

All of the necessary building blocks for security data lakes today are delivered as a service. They’re supported by well-funded engineering and research teams across leading vendors with diverse approaches to threat detection and response. And unlike in the standalone SIEM model, the modern security data lake enables the security team to own their data in their company’s existing cloud data platform. That also means that future security initiatives can use all of the data science and self-service dashboard capabilities of the modern data stack. And that’s why this time around, security data lake projects are succeeding — and so will yours.

--

--

Omer Singer

I believe that better data is the key to better security. These are personal posts that don’t represent Snowflake.