Machine Learning detecting highest change of a value

I'm experimenting with machine learning in our evaluation and trying to find population members with the largest absolute change of a value of field in the time period; that is the largest max(field) - min(field) in the timeperiod

Looking at the Detectors available I can't seem to find one that does this. The closest that I've found is high_varp, but I don't think this is quite what we are after?

Is there something I am missing here?

Thanks!

You cannot combine detectors like this. However, you can create a scripted field that calculates the difference on the fly (and name that field delta for example) and then use ML (i.e. max(delta)) on that new field

Example: ML Job on Scripted field

I'm not hugely familiar with scripted fields, but confused how a scripted field would help me here. Doesn't a scripted field only add a value for each document? Not sure how adding fields to the documents would help us here.

Unless it's possible to add scripts at other levels in the process?

To clarify, this is the max difference of one field (see example below), not the max difference between two separate fields in a document

user  widgets
====  =======
1     202
1     354
1     355
1     1023
2     525643
2     525645

user 1 "widgets" delta = 1023-202 = 821
user 2 "widgets" delta = 525645-525643 = 2

Again, am I missing something?

Ok, understood - yes, scripted fields manipulate fields within a document. Thanks for providing an example - so if the max delta that you're after is to first find the max delta (per user) then I believe you're going to have to delve into the world of pipeline aggregations.

Specifically, I think you're going to need to pipeline the following:

  1. a date_histogram (link) aggregation to bucket the data in intervals of time
  2. a terms (link) aggregation to separate the data per user
  3. a max (link) aggregation to find the max number of widgets per user (in a bucket)
  4. a min (link) aggregation to find the min number of widgets per user (in a bucket)
  5. a bucket_script (link) aggregation to find the difference between max and min, per user (in a bucket)

An example search aggregation (using a play dataset that can be found here)

POST farequote-*/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "buckets": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "day",
        "time_zone": "UTC"
      },
      "aggregations": {
        "@timestamp": {
          "max": {
            "field": "@timestamp"
          }
        },
        "airlines": {
          "terms": {
            "field": "airline",
            "size": 200,
            "order": {
              "_count": "desc"
            }
          },
          "aggregations": {
            "max": {
              "max": {
                "field": "responsetime"
              }
            },
            "min": {
              "min": {
                "field": "responsetime"
              }
            },
            "max_delta": {
              "bucket_script": {
                "buckets_path": {
                  "maxval": "max",
                  "minval": "min"
                },
                "script": "params.maxval - params.minval"
              }
            }
          }
        }
      }
    }
  }
}

The output looks like:

{
  "took" : 61,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 86274,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "buckets" : {
      "buckets" : [
        {
          "key_as_string" : "2017-02-07T00:00:00.000Z",
          "key" : 1486425600000,
          "doc_count" : 17211,
          "@timestamp" : {
            "value" : 1.486511998E12,
            "value_as_string" : "2017-02-07T23:59:58.000Z"
          },
          "airlines" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "AWE",
                "doc_count" : 1718,
                "min" : {
                  "value" : 16.769500732421875
                },
                "max" : {
                  "value" : 23.477800369262695
                },
                "max_delta" : {
                  "value" : 6.70829963684082
                }
              },
              {
                "key" : "AAL",
                "doc_count" : 1715,
                "min" : {
                  "value" : 22.50950050354004
                },
                "max" : {
                  "value" : 182.12440490722656
                },
                "max_delta" : {
                  "value" : 159.61490440368652
                }
              },
              {
                "key" : "UAL",
                "doc_count" : 1158,
                "min" : {
                  "value" : 6.731100082397461
                },
                "max" : {
                  "value" : 13.200699806213379
                },
                "max_delta" : {
                  "value" : 6.469599723815918
                }
              },
...

Assuming you can do all of the above, you'll then need to adapt the above a little in order to allow your ML job to leverage these aggregations. See (link) and (link)

Wow! Okay thanks for the detailed explanation! Pipeline aggregations have always seemed a bit scary... I'll have a go at this and let you know how I get on.

Thanks once again :slight_smile:

No worries - it might also be worth keeping it simple and just trying high_varp and splitting on user. I would think it is close to representing what you want.

I've got the pipeline aggregation working in a simple search, but there doesn't seem to be any way to create an ML job and create a detector with a field_name here of what I've just created.

Not sure this is actually possible?

It would seem to me to be a common enough use-case that it ought to have it's own detector?

See (link) and (link)

Thanks! That first link really helped. I think my problem is that I've been trying to use the Job Management in Kibana to do this rather than going directly to the JSON. I think there's no way to do aggregations using the interface.

Although working for small, sadly I think that the cardinality of my user field is now too great (approx 40,000) to do the terms aggregation on all users - which just exceeds the memory limits too quickly so makes this approach impractical.

Once again - I think there is need for a strong need for "spread", "high_spread", and "low_spread" detectors here!

And thanks for your help :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.