Dec 12th, 2019 [EN] [Elasticsearch] Data Transforms: More Than Meets the Eye

With great power comes great responsibility

You're running at scale, with petabytes of proxy logs, desktop event streams, and endpoint security alerts at your fingertips. The thing is, human beings don't think about log lines and event fields - we think in terms of users, sessions, and vulnerabilities. But repeatedly querying many indexes across multiple petabytes seems expensive..

There's another option: data transforms, introduced in version 7.3 of the Elastic stack. Data transforms are a way to create summary indexes from existing data, either one time or on an ongoing basis.

For example, if I have a corporate intranet, and collect user session logs containing a user agent, IP address, and many more bits of information, I may want to figure out if a suspicious session has logged in from a new location, or using a new device.

joseph "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36" 192.0.2.42
joseph "Mozilla/5.0 (iPhone; CPU iPhone OS 11_4 like Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko) CriOS/67.0.3396.69 Mobile/15F79 Safari/604.1" 192.0.2.42
bobert "[Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" 192.0.2.74
joseph "Mozilla/5.0 (Mobile; Windows Phone 8.1; Android 4.0; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; NOKIA; Lumia 635) like iPhone OS 7_0_3 Mac OS X AppleWebKit/537 (KHTML, like Gecko) Mobile Safari/537" 203.0.113.64

Here, Joe and Bob have logged in from the corporate network IP range with their corporate devices, all looks good - until "Joe" logs in with a non-standard device from an IP range we've never seen before! This is easy for us to see, but without querying our large collection of access logs every few seconds, how can we detect this automatically and efficiently?

POST _transform/_preview
{
  "source": {
    "index": "intranet_access_logs"
  },
  "dest" : { 
    "index" : "intranet_access_logs_by_user"
  },
  "pivot": {
    "group_by": { 
      "user": { "terms": { "field": "user" }}
    },
    "aggregations": {
      "total_unique_devices": { "cardinality": { "field": "user_agent.device.name" }}
    }
  }
}

Or, create it via the graphical interface:

Here, we've created a simple derivative index ("Entity Centric" is the fancy name), keeping just the running unique values of the user agent's device type field. This secondary index remains quite small and fast to query, and as long as Joe continues to use the same devices that he has always used, all is well. However, if a new device is detected, perhaps via a watcher that sends a Slack message to the SecOps team channel when triggered, the team can immediately investigate in their SIEM.

For extra credit, can you extend this to use Elastic Machine Learning to detect when the IP in a given log line is anomalous, using another transform field? Or detect a new, non-whitelisted IP range using a data enrichment processor?

Hopefully, this quick tour of a simple data transform has gotten your imagination going on how you could use this amazing capability to summarize data quickly and automatically. Want to learn more? Sign up to watch our webinar on data transforms!

6 Likes