Marc Benioff could sense the tension when he walked into a Salesforce meeting in the summer of 2000, wearing a button on his shirt with the word software crossed out in a red circle. As he describes in an excerpt from “Behind the Cloud”,
The End of Software mission and the NO SOFTWARE logo effectively conveyed how we were different. I put the logo on all our communications materials and policed it to make sure no one removed it. (They did so anyway.) I wore a NO SOFTWARE button every day and asked our employees to as well. (They did so, somewhat reluctantly.)
Benioff’s vision of an end to software didn’t mean that companies would dump their PCs and return to typewriters. No Software meant No Legacy Software: an end to lengthy installations and upgrades, systems to maintain, and large upfront contracts.
The result was a revolution in enterprise software. SaaS became the dominant computing model and happy customers drove Salesforce to the pinnacle of the tech industry.
Unfortunately, the success of SaaS resulted in a painful side effect for security teams. While on-prem systems could ship logs to a central database for monitoring and analytics, SaaS solutions generally don’t push log data to their customers. Customers are on the hook to scrape APIs instead.
The model for customers to get their data (including logs and asset details) from SaaS vendors is dominated by a client-server architecture based on REST APIs. You can check out this primer for more details and this great diagram:
While APIs are convenient for getting limited datasets like user lists, this “middleman” approach runs into a wall for keeping large datasets synchronized. Typical APIs will enforce a severe limit on how much data can be copied in a request. Also, while some APIs support query logic, these are limited to the dataset at the vendor so any attempt to combine multiple datasets can only happen after data is copied from the vendor to the customer.
For example, the following API specification for downloading firewall event logs via an API includes a 5,000 event limit. These devices are designed to handle at least 10,000 events per second in a production environment so there’s an expectation that customers will want only a limited subset of activity logs. Complicating things further, there 11 types of logs supported by this API and the customer needs to specify which they’re interested in retrieving.
Some security vendors have built data replication tools to bypass the limitations of APIs. These tools require a complicated pipeline managed by the customer, while introducing other issues like data latency that may be measured in days.
For example, one Endpoint Detection & Response (EDR) vendor writes the following in their documentation:
[Vendor]’s data feed includes only raw events, which describe individual actions taking place on your hosts. This feed is most useful for customers who want to archive their data for longer than their [vendor] retention period, often as part of a compliance strategy for regulations like HIPAA or GDPR. We don’t perform any analysis of raw events; we leave that up to you.
Every 5 days, [vendor] puts a new batch of [vendor] data into a data directory in your S3 bucket. Each time you get an SQS notification about a new batch of data, you should consume the notification and copy the data from S3 to another location for processing.
Clearly, this delayed data is not intended to support meaningful analytics. But why should customers not have easy, real-time access to their own data?
Data Sharing as Alternative
Cloud security vendor Valtix wants to make its customers aware of cloud vulnerabilities and application attacks targeting those vulnerabilities. To do so, Valtix needs to stream and analyze terabytes of network flow logs from the customer’s cloud infrastructure provider.
Valtix looked for a way to make this customer data available to the customers. These are datasets that can serve use cases including threat detection, hunting, investigation and cloud compliance. For example, a customer’s threat hunting team might suspect that a certain EC2 server in their AWS environment has been compromised. Using data that Valtix collected and enriched, the customer’s investigation team would be able to check which systems and applications were accessed from the compromised server.
Valtix chose to use Snowflake Data Sharing as the way to make cloud and application security logs available to their customers. For customers that use Snowflake, the Valtix Data Lake provides a secure, read-only view to threat and flow data that’s as if the data was already in the customer’s data lake. Snowflake’s multi-tenant architecture means no copying is needed to make this happen. Data sharing replaces the need for copying data by API.
Snowflake’s data sharing was originally targeted at large enterprises that had internal silos affecting their ability to get value from data. The solution brief from 2017 is an interesting read from a historical perspective, and while the core technology is similar, the scope has expanded dramatically. There’s now a Data Exchange that’s quickly expanding to connect customers and vendors across the major cloud providers and regions.
The security data lake of the future spans vendors and customers. It’s the next step in the SaaS revolution, with data sharing to break silos and improve threat detection, governance, and metrics. It’s a future that’s bright with better security and less headaches.. and that’s why I’m looking forward to the End of APIs.