Out of the Box ML jobs for Security

Good Morning,

I am looking for help/advice in regards to using the out of the box ML jobs. Specifically, some of the jobs that are underlying SIEM detections. I have an environment with 4500 hosts sending endpoint data and some network data to the stack. The analysts (who are the customer) have access to the Elastic stack and an Endgame appliance. Right now they are using Endgame pretty heavily and use the logs we are ingesting to correlate some of the things they are seeing in Endgame. I am hoping to use ML and the security app to augment what is already in Endgame.

The ideal situation is that they will alter their workflow to use the Elastic Security App and ML much more than they do now. The hurdle is Endgame is really good, and I am having a tough time figuring out what I can do in Elastic that is a genuine value add to Endgame.

The issue that I am running into is, the out of the box security detections in Elastic are a little redundant to what is being seen in Endgame. I am trying to figure out how to extract more value out of the Security features within Elastic (if possible).

I have spent some time running about 10 of the out of the box ML jobs. These jobs are doing what they are supposed to do from a ML perspective and are identifying anomalies. The issue is the environment is pretty big, complex and diverse so unique things happen frequently.

For instance rare_process_by_host_windows_ecs. It is detecting a lot of benign processes being run by admin accounts, service accounts or regular users. I could just tune these accounts out or make a running list of those processes, but doing both of those things conflict with my idea of ML. If there was a way to pull in the hashes of these files and then throw those against a database with malicious hashes in it then we would be in business. We are working on this solution, but to be honest we could just run an indicator match query once we ingest the threat intel data w/o bothering w/ ML.

The next example is v2_windows_anomalous_user_name_ecs. Here again there are quite a few account types that we can get rid of, but there are still a huge number of anomalies over a 7 day period, something like 400 once I get rid of all service accounts and admins (again pretty high value accounts).

I will try to get to the point. Have you guys had any success with the out of the box jobs or have you had to build your own jobs? I don't have a problem building my own jobs, and the out of the box jobs have given me a good starting place and a chance to learn a little bit more about elastic ML, but I don't want to sink to much time into an endeavor that might not pay off.

Thanks!

Hi Alex - so the ML rules for rare process by host, and anomalous user name, for Windows, both have a threshold of 50 which can be increased if a threshold of more like, say, 75, if that outputs fewer and more interesting anomalies, and you are using the ML rules to convert anomalies into alerts.

There also are some tuning approaches that may be useful here. Rare process by host, for Windows, can be applied to critical servers like domain controllers which often have a lower behavioral variance in process activity than the Windows fleet as a whole. The job named Anomalous Process For a Windows Population may give you results more to your liking for hunting anomalous process activity across your fleet.

If you have a cluster of benign processes, in the anomaly results, with a commonality, that can also be a simple tuning measure. For example, if a cluster of benign but unusual processes have the same user context, or the same code signer, or the same parent process, or the same hostname, that could be used as a filter in the datafeed query.

Turning to the username anomalies, are the service accounts and admins being scored above 50, the default threshold for an alert? I would have thought those accounts would tend to be more dense, in the log data, and less likely to be outliers with high scores. If these users are associated with automation or housekeeping processes that run very occasionally, those could also be excluded, using one of the filtering approaches for process events in the previous paragraph.

Another approach I am exploring is to correlate the ML alerts with additional alerts in order to produce a sort of meta-alert that is more investigation-ready. Alerts can be joined by a common field, like host.name, using a transform, to look for cases where a host has a set of alerts including some combination of machine learning anomaly detection or beaconing classification; query based alerts; and threat intel matches. Using OSQuery data, it will be possible to apply additional logic to these meta-alerts such as considering the server role - is it a critical server like a domain controller - and the privilege level of a user. Stay tuned for more on this.

Hope this helps, let me know. Regards, Craig

1 Like

Craig,

Thanks so much for this response.

I am going to make the adjustment to the jobs to increase the threshold to critical. I think that is a great place to start to at least minimize the amount of stuff that we are seeing.

I really like the idea of applying the job to only servers like DCs. It seems like the best way to do that with one of the out of the box jobs would be to clone the job and alter the datafeed query to ensure that it is only pulling in data from server types we are interested in. What do you think of that approach?

I am going to take a look at getting "anomalous process for a windows population" going in our environment. I think that is a really good suggestion as the anomaly detections that we have been using so far are finding a lot of anomalies, because a lot of unique stuff happens here. I am wondering if using the population analysis will return better results. My understanding is basically that instead of looking at differenes in a single machines behavior over time it will compare a single machines to other machines like it. My only question here is will it learn the difference between workstations and servers? Then, there is other stuff like geographical location. As all of these jobs are unsupervised I am guessing that is the idea, but I figured I would ask.

I am interested in the logic of the next paragraph, but unsure of how to pull it off at the moment. When I look at the results that I am getting in anomaly explorer I am only seeing information of the "influencers" related to an anomalous event. Seems to be the same for the results index for each of these jobs. What is the best way to correlate the anomaly to the actual event that caused it to trigger as an anomaly within its original index?

For the username anomalies we have several different types of admin accounts in the environment and all of them have anomalies that score up to 98. This makes sense in our environment to a certain extent, because the staff turn over is fairly high and there are lots of admins, administering different things in different places. Does it seem like using an out of the box job in this kind of environment on this particular use case makes sense?

Could you possibly explain the mechanics of the last paragraph a little more? Where do you start? It sounds like a great idea. I am going to have to learn how to use transforms though. Do you basically identify a host.name that you want to create a secondary index for and then select a few data sources. For instance the .siem, .ml-anomalies and threat intel indices for the presence of that host.name and if it hits it automatically populates the secondary index with that data? I understand if that is a little bit too large of a question to tackle here. I really appreciate the idea though. I definitley have to put transforms on our road map.

Glad this helped. So Windows server roles , like the domain controller role, can be enumerated using this query in the OSQuery integration:

select * from windows_optional_features

The resulting events can be turned into building block alerts for correlations. Doing joins like this requires a correlation and I have just shared an example here: examples/Machine Learning/Transforms at master · elastic/examples · GitHub

I'll share more about creating these kinds of compound alerts with joins in the near future.

On the username anomalies, it may be possible to identify the admin users using OSQuery, and I'll look into that. Another possibility may be to partition the job into more like a "rare username for a host." I'll look into that as well and may add such a job to the Windows module.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.