ML datafeed bucket aggregation/script challenge

rcowart · February 17, 2022, 10:45pm

I would like to create an anomaly detection job for the ratio of two counts. The logic is as follows...

A = count filtered by condition X
B = count filtered by condition Y
C = A/B

The datafeed would need to produce C as a field to be used in the analysis_config.

I have been trying to figure out some combination of bucket aggregation/bucket script/scripted fields that would work as a datafeed. The first challenge is whether it is even possible to have any query that produces two bucket aggregation values each derived from a different filter condition? If I can get these two values a bucket_script would be easy enough to do the math.

Does anyone have any tips on how something like this can be done?

richcollier · February 18, 2022, 12:12pm

Hi Rob - this should help: Analyzing a ratio of documents over time with Anomaly Detection

rcowart · February 18, 2022, 12:54pm

@richcollier that was a big help. Thanks!

rcowart · March 6, 2022, 4:48pm

Is there any way to get a field into the aggregation-based datafeed that can be used for as a partition_field? It is possible to add levels of aggregation that create the buckets. However I haven't been able to get the key for the buckets as a field that can be accessed in the detectors config.

richcollier · March 7, 2022, 5:50pm

Robert, here's an example using the demo dataset farequote:

(note I put a terms agg of size 100 but there are only like 19 different airlines)

PUT _ml/anomaly_detectors/farequote_terms_agg
{
  "analysis_config": {
    "bucket_span": "5m",
    "detectors": [{
      "function": "mean",
      "field_name": "responsetime",  
      "partition_field_name": "airline"  
    }],
   "influencers" : ["airline"],
    "summary_count_field_name": "doc_count"
  },
  "data_description": {
    "time_field":"@timestamp"  
  },
  "datafeed_config":{
    "indices": ["farequote"],
    "aggregations": {
      "buckets": {
        "date_histogram": {
          "field": "@timestamp",
          "fixed_interval": "5m",
          "time_zone": "UTC"
        },
        "aggregations": {
          "@timestamp": {  
            "max": {"field": "@timestamp"}
          },
          "airline": {  
            "terms": {
             "field": "airline",
              "size": 100
            },
            "aggregations": {
              "responsetime": {  
                "avg": {
                  "field": "responsetime"
                }
              }
            }
          }
        }
      }
    }
  }
}

rcowart · March 7, 2022, 7:47pm

That worked. The location in the hierarchy of aggregations is important. The key here was that it needs to be at the same level as @timestamp, i.e. "inside" the date histogram.

rcowart · March 7, 2022, 9:35pm

Here is the final result:

system · April 4, 2022, 9:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Anomaly detection on ratio of two counts Elasticsearch elastic-stack-machine-learning	5	784	March 26, 2020
Analyzing a ratio of documents over time with Anomaly Detection Elasticsearch elastic-stack-machine-learning	1	955	April 23, 2020
Problem with Bucket Script Aggregation in Machine Learning Datafeed Elasticsearch elastic-stack-machine-learning	1	1177	January 11, 2019
Two subaggregation in datafeed Elasticsearch elastic-stack-machine-learning	19	960	October 30, 2018
Bucket_script location Elasticsearch	3	2319	October 11, 2018

ML datafeed bucket aggregation/script challenge

Related topics