ML datafeed bucket aggregation/script challenge

I would like to create an anomaly detection job for the ratio of two counts. The logic is as follows...

A = count filtered by condition X
B = count filtered by condition Y
C = A/B

The datafeed would need to produce C as a field to be used in the analysis_config.

I have been trying to figure out some combination of bucket aggregation/bucket script/scripted fields that would work as a datafeed. The first challenge is whether it is even possible to have any query that produces two bucket aggregation values each derived from a different filter condition? If I can get these two values a bucket_script would be easy enough to do the math.

Does anyone have any tips on how something like this can be done?

Hi Rob - this should help: Analyzing a ratio of documents over time with Anomaly Detection

@richcollier that was a big help. Thanks!

1 Like

Is there any way to get a field into the aggregation-based datafeed that can be used for as a partition_field? It is possible to add levels of aggregation that create the buckets. However I haven't been able to get the key for the buckets as a field that can be accessed in the detectors config.

Robert, here's an example using the demo dataset farequote:

(note I put a terms agg of size 100 but there are only like 19 different airlines)

PUT _ml/anomaly_detectors/farequote_terms_agg
{
  "analysis_config": {
    "bucket_span": "5m",
    "detectors": [{
      "function": "mean",
      "field_name": "responsetime",  
      "partition_field_name": "airline"  
    }],
   "influencers" : ["airline"],
    "summary_count_field_name": "doc_count"
  },
  "data_description": {
    "time_field":"@timestamp"  
  },
  "datafeed_config":{
    "indices": ["farequote"],
    "aggregations": {
      "buckets": {
        "date_histogram": {
          "field": "@timestamp",
          "fixed_interval": "5m",
          "time_zone": "UTC"
        },
        "aggregations": {
          "@timestamp": {  
            "max": {"field": "@timestamp"}
          },
          "airline": {  
            "terms": {
             "field": "airline",
              "size": 100
            },
            "aggregations": {
              "responsetime": {  
                "avg": {
                  "field": "responsetime"
                }
              }
            }
          }
        }
      }
    }
  }
}

That worked. The location in the hierarchy of aggregations is important. The key here was that it needs to be at the same level as @timestamp, i.e. "inside" the date histogram.

Here is the final result:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.