Machine Learning detecting highest change of a value

jong99 · March 6, 2019, 11:44am

I'm experimenting with machine learning in our evaluation and trying to find population members with the largest absolute change of a value of field in the time period; that is the largest max(field) - min(field) in the timeperiod

Looking at the Detectors available I can't seem to find one that does this. The closest that I've found is high_varp, but I don't think this is quite what we are after?

Is there something I am missing here?

Thanks!

richcollier · March 6, 2019, 12:02pm

You cannot combine detectors like this. However, you can create a scripted field that calculates the difference on the fly (and name that field delta for example) and then use ML (i.e. max(delta)) on that new field

Example: ML Job on Scripted field

jong99 · March 6, 2019, 12:44pm

I'm not hugely familiar with scripted fields, but confused how a scripted field would help me here. Doesn't a scripted field only add a value for each document? Not sure how adding fields to the documents would help us here.

Unless it's possible to add scripts at other levels in the process?

To clarify, this is the max difference of one field (see example below), not the max difference between two separate fields in a document

user  widgets
====  =======
1     202
1     354
1     355
1     1023
2     525643
2     525645

user 1 "widgets" delta = 1023-202 = 821
user 2 "widgets" delta = 525645-525643 = 2

Again, am I missing something?

richcollier · March 6, 2019, 4:13pm

Ok, understood - yes, scripted fields manipulate fields within a document. Thanks for providing an example - so if the max delta that you're after is to first find the max delta (per user) then I believe you're going to have to delve into the world of pipeline aggregations.

Specifically, I think you're going to need to pipeline the following:

a date_histogram (link) aggregation to bucket the data in intervals of time
a terms (link) aggregation to separate the data per user
a max (link) aggregation to find the max number of widgets per user (in a bucket)
a min (link) aggregation to find the min number of widgets per user (in a bucket)
a bucket_script (link) aggregation to find the difference between max and min, per user (in a bucket)

An example search aggregation (using a play dataset that can be found here)

POST farequote-*/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "buckets": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "day",
        "time_zone": "UTC"
      },
      "aggregations": {
        "@timestamp": {
          "max": {
            "field": "@timestamp"
          }
        },
        "airlines": {
          "terms": {
            "field": "airline",
            "size": 200,
            "order": {
              "_count": "desc"
            }
          },
          "aggregations": {
            "max": {
              "max": {
                "field": "responsetime"
              }
            },
            "min": {
              "min": {
                "field": "responsetime"
              }
            },
            "max_delta": {
              "bucket_script": {
                "buckets_path": {
                  "maxval": "max",
                  "minval": "min"
                },
                "script": "params.maxval - params.minval"
              }
            }
          }
        }
      }
    }
  }
}

The output looks like:

{
  "took" : 61,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 86274,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "buckets" : {
      "buckets" : [
        {
          "key_as_string" : "2017-02-07T00:00:00.000Z",
          "key" : 1486425600000,
          "doc_count" : 17211,
          "@timestamp" : {
            "value" : 1.486511998E12,
            "value_as_string" : "2017-02-07T23:59:58.000Z"
          },
          "airlines" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "AWE",
                "doc_count" : 1718,
                "min" : {
                  "value" : 16.769500732421875
                },
                "max" : {
                  "value" : 23.477800369262695
                },
                "max_delta" : {
                  "value" : 6.70829963684082
                }
              },
              {
                "key" : "AAL",
                "doc_count" : 1715,
                "min" : {
                  "value" : 22.50950050354004
                },
                "max" : {
                  "value" : 182.12440490722656
                },
                "max_delta" : {
                  "value" : 159.61490440368652
                }
              },
              {
                "key" : "UAL",
                "doc_count" : 1158,
                "min" : {
                  "value" : 6.731100082397461
                },
                "max" : {
                  "value" : 13.200699806213379
                },
                "max_delta" : {
                  "value" : 6.469599723815918
                }
              },
...

Assuming you can do all of the above, you'll then need to adapt the above a little in order to allow your ML job to leverage these aggregations. See (link) and (link)

jong99 · March 6, 2019, 4:52pm

Wow! Okay thanks for the detailed explanation! Pipeline aggregations have always seemed a bit scary... I'll have a go at this and let you know how I get on.

Thanks once again

richcollier · March 6, 2019, 5:00pm

No worries - it might also be worth keeping it simple and just trying high_varp and splitting on user. I would think it is close to representing what you want.

jong99 · March 7, 2019, 3:42pm

I've got the pipeline aggregation working in a simple search, but there doesn't seem to be any way to create an ML job and create a detector with a field_name here of what I've just created.

Not sure this is actually possible?

It would seem to me to be a common enough use-case that it ought to have it's own detector?

richcollier · March 7, 2019, 3:49pm

See (link) and (link)

jong99 · March 8, 2019, 10:10am

Thanks! That first link really helped. I think my problem is that I've been trying to use the Job Management in Kibana to do this rather than going directly to the JSON. I think there's no way to do aggregations using the interface.

Although working for small, sadly I think that the cardinality of my user field is now too great (approx 40,000) to do the terms aggregation on all users - which just exceeds the memory limits too quickly so makes this approach impractical.

Once again - I think there is need for a strong need for "spread", "high_spread", and "low_spread" detectors here!

And thanks for your help

system · April 5, 2019, 10:18am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Scripted metric with agg field Kibana	2	368	May 31, 2018
Script Field in Kibana with Time Passed Since Timestamp Kibana	3	2338	July 6, 2017
Is it possible to do a scripted field for serial difference? Kibana painless	4	658	June 4, 2020
Painless script for a comparison between two fields taken at interval Kibana	9	1500	February 4, 2021
Max between two fields in last X days Kibana(lens) Kibana	17	1703	February 17, 2022

Machine Learning detecting highest change of a value

Related topics