Tips to create a Machine Larning job

Hello guys,

I´m trying to create a ML job which detect anomalies using an index pattern which contains information about traffic DNS (Cisco Umbrella). With this log, I can get categories about the DNS resolutions, knowing if the requests are Malware or suspicious webs.
I´m trying to deploy some ML jobs to try to identify that kind of "anomalies", I mean, the anomalies which contains Malware, for example. Or just try to see what information about anomalies ML can give to me, not specially about Malware, anything.

The thing is I dont know totally how ML jobs works, even I read documentation. In the log, Malware events are the lowest, so should be easy to get them using ML.

What do you recommend me to create the ML job? I mean, single metric, Multi, population....
What information can i get?

The data structure of the log parsed is as follows:

{
  "_index": "myindex",
  "_type": "_doc",
  "_id": "1",
  "_version": 1,
  "_score": null,
  "_source": {
    "path": "/...",
    "Timestamp": "2020-03-03 22:49:27",
    "file": {
      "name": "2421459_5c27c809c3f9c2a4d9aaba22472f976d5a5813b7-dnslogs-2020-03-03-2020-03-03-22-40-0076.csv.gz"
    },
    "source": {
      "ip": "10.10.10.10"
    },
    "event": {
      "module": "DNS",
      "action": "Blocked"
    },
    "dns": {
      "type": "DNSLog",
      "repose_code": "NOERROR",
      "answers": {
        "type": "Malware"
      },
      "op_code": "1 (A)",
      "question": {
        "name": "mail.look251.com."
      }
    },
    "host": "localhost.localdomain",
    "message": "{\"sourceFile\":\"2421459_5c27c809c3f9c2a4d9aaba22472f976d5a5813b7-dnslogs-2020-03-03-2020-03-03-22-40-0076.csv.gz\",\"EventType\":\"DNSLog\",\"Timestamp\":\"2020-03-03 22:49:27\",\"MostGranularIdentity\":\"DNS\",\"Identities\":\"DN\",\"InternalIp\":\"10.10.10.10\",\"ExternalIp\":\"10.10.10.10\",\"Action\":\"Blocked\",\"QueryType\":\"1 (A)\",\"ResponseCode\":\"NOERROR\",\"Domain\":\"mail.look251.com.\",\"Categories\":\"Malware\"}\r",
    "@timestamp": "2020-03-03T21:49:27.000Z",
    "@version": "1"
  },
  "fields": {
    "@timestamp": [
      "2020-03-03T21:49:27.000Z"
    ]
  },
  "highlight": {
    "dns.answers.type": [
      "@kibana-highlighted-field@Malware@/kibana-highlighted-field@"
    ],
    "message": [
      ",\"Categories\":\"@kibana-highlighted-field@Malware@/kibana-highlighted-field@\"}"
    ]
  },
  "sort": [
    1583272167000
  ]
}

Data is parsed according to ECS (exactly field I want to use like dns categorie, event action, etc).

Any idea?
Thank you very much!

Regards,

You're going to most likely use the count function to track the occurrence rate of these types of messages over time.

You probably should pre-filter the types of messages you want to track (i.e. dns.answers.type:"Malware" and dns.answers.type:"whatever else" and ... ) and save that filter as a Saved Search (in Kibana). Then, use that saved search as the basis for your ML job (instead of every document in the index).

Probably a multi-metric job - again, using count ("Count(Event rate)") as the thing you track, and choose dns.answers.type as the "split field".

There are more advanced ML techniques that can be employed on DNS data (including DNS Tunnelling/Exfiltration detection, etc.). Look at the same jobs within the SIEM app and other examples on this forum (like this one: Security Analytics Recipes - DNS Data Exfiltration)

@richcollier thank you very much for help and tip. Im going to try with your way.

So once the ML job is created, does it will work with the full data even with real-time incoming logs? or maybe this way just work for this exactly case, trying to detect the anomalies in that set of data.

Because if i´m pre-filtering data, ML is not actually finding the anomalies in a data set, just I´m giving a filtered set with that anomalies "found", avoiding ML to find them... I wonder.

ML processes data in chronological order and it only "looks at" the data once. So, if you have ML learn on some historical data, but then ask it to run on-going (in "real-time") then it never looks back at the old raw data ever again.

ML finds deviations in the value or rate of things over time. If you normally get X occurrences of Y per hour/day/whatever, and now you're getting 10X - that's anomalous.

If you want a detection/alert for every time you see X, then you don't even need ML. You just need a standard threshold-based alert.

Thank you very much for explanation! All clear :slight_smile:

Regards,

@richcollier I was thinking about it...
Just to asure I´m understanding the concept of Machine Learning for Anomaly Detection in Elastic, If I want to have an approach to detect Malware which exploit 0 days vulnerabilities or It has a behavior which is different to common Malware, Is it not possible to train a ML job to detect in real time this kind of Malware?
I mean, if i gave a dataset with contains samples of Malware behavior to show the alogirthm how to detect it, maybe if then I start a datafeed using real time data can detect anomalies.
is it possible?

Anyway, whats the best approach to use ML into Elastic? I mean, could you just give me an example of PoC?
Thank you very much for your support.

kind regards

The notion of "training on malware behavior" is a Supervised Learning approach to ML. Elastic's Anomaly Detection is Unsupervised - meaning that you do not tell it what is good or what is bad - it merely detects changes to data's behavior along a certain dimension. For example, in the case of DNS Exfiltration detection, we don't actually tell Elastic ML that "DNS Exfiltration looks like THIS" - we instead say "If exfiltration via DNS is happening, the way that it will occur is that it will leverage encryption in the subdomain part of the DNS requests. If you see some major behavior change in the amount of data encrypted in the subdomain, detect it and alert me as it might be actual data exfiltration".

Supervised Learning is a different approach. You can do Supervised Learning (and model inference) using DataFrame Analysis.

Also, here's a blog that describes malware detection using Outlier Analysis that you might find interesting!

1 Like

Again, thank you very much for your explanation. All clear!!!

Regards,

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.