Machine Learning: Rare function not working as expected

Hi all,

I am experimenting with using ML to identify anomalies for rare children of a parent by host. My data looks as follows:

{
    "@timestamp": "2017-01-01T01:01:01.799Z",
    "combined_name": "parent1_child1",
    "child": "child1",
    "parent": "parent1",
    "host": "Host123"
  }

If I set up a detector as follows:

"analysis_config": {
    "bucket_span": "30s",
    "detectors": [
      {
        "detector_description": "rare children",
        "function": "rare",
        "by_field_name": "combined_name.keyword",
        "detector_rules": []
      }
    ],
    "influencers": [
      "host.keyword"
    ]
  }

It fails to detect any anomalies. I tried a different approach:

{
        "detector_description": "rare children",
        "function": "rare",
        "by_field_name": "child.keyword",
        "over_field_name": "parent.keyword",
        "detector_rules": []
      }
    ],
    "influencers": [
      "host.keyword"
    ]
  }

Any help would be appreciated!

Many thanks,

Matt.

1 Like

Hi Matt,

Before making a recommendation, I thought I'd give you some general background on the rare functionality which will hopefully be useful when thinking about applying it in other cases as well.

First, all our rare functions are relative, i.e. something is only rare if it stands out as rare with respect to the other items for the by field. If for example, a data set comprises 100 records each with a unique value for the field x, say x1, x2, ..., x100, then rare by x will produce no anomalies because all the values are equally unusual. There is no reason to highlight when a bucket contains x1 vs for example x22. This means that it is possible that if the by field value has a long tail of unique, or very unusual, values then we may say there are no rare values.

Second, "rare by x" vs "rare by x over y" have somewhat different meanings:

  • "rare by x" is looking for time intervals, i.e. buckets, which are especially unusual, i.e. they contain sets of by field values which don't occur in many buckets (compared to typical). So a by field value is rare if it occurs in many fewer buckets than is typical. Also, buckets which have multiple rare field values are considered especially rare.
  • "rare by x over y" is looking for an interaction of y, i.e. all the records of y in a time bucket, which is especially unusual. Here, we say a value of x is rare only if it isn't generated by many unique values of y, i.e. it is something fairly specific to just that value of y, again compared to typical. As with "rare by x", an interaction is especially unusual if it comprises multiple rare field values.

Going back to your case. When you are doing "rare by parent.child" you are looking for a combination of rare child and parent which conceptually can be rare because the child is rare, the parent is rare or the child is rare for the parent. However, I expect you are running into the first case above that there are many unique or very unusual combinations and so nothing stands out as especially rare. When you are doing "rare by child over parent" you are using that different measurement of rare, i.e. looking for children of very few parents.

To aim at just what you are after, i.e. target finding rare children of a parent, the best detector is "rare by child partition=parent". This creates a separate rare analysis for each partition, i.e. it'll tell you about buckets where child processes crop up which haven't occurred in many buckets in the past, but narrowed to only those records associated with a given parent. Even with this detector, the data characteristics might be such that nothing is especially unusual, i.e. there may not be many examples of (parent, child) pairs, or even for a parent there may be many fairly unique children. In these case we don't say anything stands out as especially unusual. One other thing to bare in mind is that our UI is geared to show highly unusual events and so may not flag up anything as a significant anomaly. You can also check our results indices which will include all results we think are even somewhat unusual.

One last thing to mention in the context of rare analysis is that it can be useful to look for entities that are unusual in multiple ways. To achieve this you can set up multiple detectors targeting different ways some data can be rare and use an influencer which is the entity. If that entity labels all the records which are unusual for each detector we notice this and flag up the entity as especially unusual. Applying this thinking to your case, you might run "rare by child partition=parent" and "rare by child over parent" and we'll highlight interactions where "an unusual child occurs for a parent" AND "that child doesn't occur for many other parents". (Again caveats regarding data characteristics apply.)

HI Tom,

Thanks for the quick and detailed response. That is much clearer and I now understand the rare concept.

Presumably the accuracy of rare will improve over time as the dataset increases in size and the model refines?

Regards,

Matt.

Hi Matt,

No problem. Yes, what you typically see is that you get a lot of new field values at start up and then things stabilise and you generally see repeated field values and the occasional new one. In these cases, we do over time get more confident that new things are genuinely anomalous.

That said, we also have an ageing mechanism which slowly forgets old data. Eventually the modelling reaches a steady state which conceptually you can think of as like a v.long sliding window (albeit with more recent data carrying more weight). The parameter for this isn't currently exposed, but some more fine grained control of aspects of the modelling is something we are thinking exposing in a future release.

Regards,
Tom

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.