Security Data Lake Implementation

6 min readAug 4, 2020

Security data lake projects are taking flight but they’re a strange bird. The folks typically responsible for creating and managing the data lake are on the data analytics team but they’re new to concepts like incident response and the abomination that is the Windows Event Log. The security team, meanwhile, is new to concepts like ETL/ELT and materialized views.

This post provides an overview of an implementation strategy that is working at Snowflake customers where data and security teams combine forces to roll out a security data lake.

Like Building a Cathedral

Creating a unified, scalable and cost-effective datastore for cybersecurity is an extensive undertaking that’s more like building a cathedral than a bridge. In Building a Cathedral, author Nicolas Kemper describes the average construction time on a Gothic cathedral as between 250–300 years with some still in progress- like Gaudi’s Sagrada Familia which has been under construction since 1882.

Thankfully, cathedrals are not binary. While a 99% complete bridge is not useful for anyone, a single room can act as a cathedral. Salisbury Cathedral took a full century to build, but services commenced almost immediately in a temporary wooden chapel.

Built between 1220 and 1320, Salisbury Cathedral is still active and home to the Magna Carta

Remember Salisbury Cathedral as you embark on your security data lake project. The following phases repeat in a cycle, with expected output from each phase and a security win at the end of each full cycle. Over time, your security data lake will deliver visibility, automation and savings not possible with any other architecture.

Phase 1: Scope

Start out by defining the scope for this cycle of your security data lake initiative. Some good questions to ask as you’re scoping:

Who is this for?

Are the users going to be Tier I SOC analysts used to “googling” their logs for domain names and IP addresses? Veteran threat hunters scripting their investigations in Jupyter notebooks? Corporate leadership looking for quarterly CIS KPI performance? Each type of user is going to have different expectations and you can’t satisfy everyone at once.

What does success look like?

Having identified your users, talk to them about the pains and limitations with the current system, most likely a stand-alone SIEM solution and cold storage buckets held together with duct tape. Dig into their challenges and have them sign off on a Definition of Done that includes measurable criteria covering ingestion volume, retention period and query performance.

Which data sources to collect?

Collecting network firewall logs is a different process than bringing in Okta authentication events. CMDB asset inventory records from ServiceNow are valuable but require a collection strategy of their own. Define the data sources that should be in scope based on the requirements of your users. I repeat: talk to your users.

Output: Requirements doc including stakeholders list, success criteria and required data sources.

Phase 2: Collect

Having determined which sources are needed for your security data lake, design and implement a data collection architecture. Consider creating a spreadsheet with categories of data sources (e.g. SaaS apps, cloud infra, security tools), individual sources (e.g. Okta, Azure, Crowdstrike) and integration method (e.g. API, Snowpipe, data sharing).

Loading data from sources

At this point you may find that some sources are going to be easy to bring in while others require much more work. This is where you may want to reconnect with your target users and see if the scope must be readjusted to remove a particularly tricky source from this iteration.

Fortunately, a solution ecosystem exists to ease your data collection. Here are some examples of what you can use to collect data into your data platform:

Built-in: Snowflake comes with Snowpipe to automatically stream data from cloud storage.
Data sharing: The Security section of the Snowflake Data Marketplace includes providers such as Obsidian Security that cover a wide range of SaaS sources including O365 and Okta.
Partner Connect: Snowflake Partner Connect includes security vendors such as Hunters that load dozens of security sources directly into your Snowflake.
Third party: Cutting-edge security solutions from Panther Labs and Elysium Analytics come with log collection to your Snowflake as part of their managed or self-hosted solution.
Open source: SnowAlert and Logstash have parsers and API integrations for some security-specific sources.
ETL: Fivetran and Matillion help customers collect certain security-related events from sources such as AWS and Jira to their Snowflake.

Review your list of in-scope sources with potential integration partners and update the planning document to reflect how each source will be ingested.

Preparing data for analysis

Spend some time considering your normalization strategy. Unlike traditional SIEM solutions, data lakes provide “schema on read” capabilities that reduce the upfront parsing burden. You don’t need to get your collected log schemas “just right” to match a fixed index. You also won’t lose data if the upstream format changes. But there are still considerations to make analytics easier on your end-users.

Borrowing from general data lake strategy (diagram shown below), we initially load the data in its raw state. This landing table may have just two columns, a TIME-type column with the load timestamp for easy retreival and a VARIANT-type column with the JSON blob of the full event as collected from the source.

Within the data platform, standard practice is to automate transformations that normalize, enrich and clean up the data. For example, you may have some firewalls in scope where the target TCP port field is called “dst_port” while others use “dest-port”. Keeping these names would make life harder on analysts so you want to choose a standard column name and enforce it in the tables where analysts will be spending most of their time.

Keep your Definition of Done (DoD) in mind here and don’t try to pull every possible key into its own standard column. Raw events can always be queried directly as long as they’re in valid JSON or XML formats.

Common data lake practices can be applied to security data lakes

Output: Snowflake tables and views with normalized datasets covering the in-scope data sources.

Phase 3: Analyze

The final phase of the cycle can begin once security data is flowing. But don’t bring your users in right away. The implementation team should write the first set of analytics as part of the setup process, for example to confirm visibility coverage and latency.

Quality control

Remind yourself that you’re building a cathedral and create a structure that will stand the test of time. That means automatically detecting issues like collection source outages, source schemas changing and collection agents missing on monitored systems.

Writing regression tests is standard practice for developers and your security data lake should call out data problems. These problems can be urgent (e.g. no CloudTrail received for over two hours) or chronic (e.g. 18% of known servers have never shipped logs). Either way, identifying these problems will improve the quality of your security data lake and serve as a dry-run for investigations. See “Are all of our servers running security agents?” for more guidance.

Another important quality element is query performance. You should have a sense for the questions in scope for the DoD, for example “I need to know all of the workstations that have communicated with this domain over the past two years.”

Write and run relevant SQL queries that cover your end-user’s expected question types and include edge cases where possible. Don’t be afraid to run these by your users to confirm the results and the execution time. If either are not acceptable, apply the architecture change or pre-processing needed to meet requirements.

Showtime

Congratulations, you’re ready to bring the users back into the room. Or the Zoom depending on the pandemic situation at time of reading.

Your users were identified during the scoping phase. Provide them with sample queries and watch as they run them. Have them adjust queries to cover questions that you haven’t tested (e.g. “I need to know all of the domains that these workstations have communicated with over the past two years). Do the query results and performance meet requirements?

Beyond a starter pack of prepared queries, consider how you can further empower your users with dashboards and metrics. For example, share with them how they can check visibility metrics using dashboards. In a production security data lake, self-service BI dashboards meet most query requirements for business users. Finally, the definition of BI “business users” can expand to include security analysts, GRC auditors and the CISO herself.