Using bucket_selector in anomaly detection datafeed

Is the use of bucket_selector supported i datafeed for an anomaly detection job?

The problem
If I set up a job having a bucket_selector in the datafeed, then the job only processes a few initial buckets and then does not really process any more buckets, whereas the job processes all the historic data and catches up to real-time if I leave out the bucket_selector from the datafeed.

In both cases the preview of the datafeed is similar and looks like this:

 [
  {
    "@timestamp": 1697940899098,
    "machine": "826_SPS",
    "RejectRate": 0.023383768913342505,
    "doc_count": 2181
  },
  {
    "@timestamp": 1697940899098,
    "machine": "876_LPS",
    "RejectRate": 0.03416856492027335,
    "doc_count": 439
  }
]

The dectector is high_mean(RejectRate) which is analysed for 15-minute buckets. During some 15-minute buckets the number of documents (items handled per machine) is low and the average RejectRate is likely to cause anomalies in these situations. That is way I wanted to filter out buckets having less than 100 documents per machine.

anomaly job

PUT _ml/anomaly_detectors/my-job-with-bucket-filter
{
"description": "Monitoring the RejectRate for 15-minute buckets per machine. Each 15-minute period is only evaluated if doc_count is >= 100.",
  "analysis_config": {
    "bucket_span": "15m",
    "summary_count_field_name": "doc_count",
    "detectors": [
      {
        "function": "high_mean",
        "field_name": "RejectRate",
        "by_field_name": "machine",
        "detector_description": "High Mean of RejectRate field (with a minimum count of 100 observations per 15 minutes)."
      }
    ],
    "influencers": [ "machine" ]
  },
  "data_description": {
    "time_field": "@timestamp",
    "time_format": "epoch_ms"
  }
}

anomaly datafeed

PUT _ml/datafeeds/datafeed-my-job-with-bucket-selector
{
  "job_id": "my-job-with-bucket-selector",
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "filter": [
        {
          "exists": {
            "field": "RejectRate"
          }
        }
      ]
    }
  },
  "indices": [
    "my-machine-data-*"
  ],
  "aggregations": {
    "buckets": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "15m",
        "time_zone": "UTC"
      },
      "aggs": {
        "@timestamp": {
          "max": {
            "field": "@timestamp"
          }
        },
        "machine": {
          "terms": {
            "field": "machine.keyword",
            "size": 30
          },
          "aggs": {
            "RejectRate": {
              "avg": {
                "field": "RejectRate"
              }
            },
            "machine_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "machine_min_doc_count": "_count"
                },
                "script": "params.machine_min_doc_count > 100"
              }
            }
          }
        }
      }
    }
  },
  "scroll_size": 1000,
  "delayed_data_check_config": {
    "enabled": true
  },
  "query_delay": "120s",
  "chunking_config": {
    "mode": "manual",
    "time_span": "15m"
  }
}

When setting up a job without the machine_filter (just by removing that part of the aggregation in the datafeed), then I have a well-working job expect for the fact that some buckets having very few records are in fact also included and analyzed, which I would like to avoid by ignoring.

Any ideas of how to ignore these buckets? How do adjust the datafeed?

Also tried with a composite aggregation instead of the nested aggregation displayed above, but the outcome was the same.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.