Dec 4th, 2018: [EN][ML] Rarity Analysis with Machine Learning

machine-learning

(rich collier) #1

Often it is easy to forget about one key capability of our ML - Rarity Analysis

Finding items that rarely occur is often very useful. Some example use cases are finding:

  • Rarely occurring log messages
  • Rare running processes for a server
  • Rare connection destinations

ML's rare function is only available in the Advanced Job Wizard, but the configuration is relatively simple. For example, if you have data in the form of:

Then the ML job configuration could simply be:

In other words, "find rare ProgramNames for every host individually". Since every host will be treated uniquely, this means that a certain process that might be routine on one server could be deemed rare on another if it doesn't appear often.

Once analyzed, we could find a situation like this:

Where a the "ftp" process is witnessed on host=files05-dc1.dc1, which is rare for that server. This is perfect for Security Analytics style use cases where one is looking for nefarious behaviors invoked by malicious insiders or malware.

Keep in mind that the rare function is relative - in other words, it takes into account the frequency of other values of the field. So, for example, in the case of the list:

A,B,A,B,A,C,B,A,B,A,C,A,B,C,A,C,B,A,C,B,A,C,X

Here, X is obviously rare. But if the list were:

A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X

X is not obviously rare because everything is rare (and thus nothing is rare).

In v6.5, we introduced a new UI component to assist with the visualization of things that are rare in the Anomaly Explorer. Here’s an example of what that looks like;

The way to interpret this is that the blue dots in the bottom half of the UI show occurrence rates of field values over time (which is the horizontal dimension, of course). Those that wind up near the bottom are the rarest ones and the selected anomaly (in this case printdialog.exe) will be shown as an enlarged dot in the bottom half of the view (here colored yellow because of its score).

Happy Detecting!