Using bucket_selector in anomaly detection datafeed

jakn · November 8, 2023, 3:02pm

Is the use of bucket_selector supported i datafeed for an anomaly detection job?

The problem
If I set up a job having a bucket_selector in the datafeed, then the job only processes a few initial buckets and then does not really process any more buckets, whereas the job processes all the historic data and catches up to real-time if I leave out the bucket_selector from the datafeed.

In both cases the preview of the datafeed is similar and looks like this:

 [
  {
    "@timestamp": 1697940899098,
    "machine": "826_SPS",
    "RejectRate": 0.023383768913342505,
    "doc_count": 2181
  },
  {
    "@timestamp": 1697940899098,
    "machine": "876_LPS",
    "RejectRate": 0.03416856492027335,
    "doc_count": 439
  }
]

The dectector is high_mean(RejectRate) which is analysed for 15-minute buckets. During some 15-minute buckets the number of documents (items handled per machine) is low and the average RejectRate is likely to cause anomalies in these situations. That is way I wanted to filter out buckets having less than 100 documents per machine.

anomaly job

PUT _ml/anomaly_detectors/my-job-with-bucket-filter
{
"description": "Monitoring the RejectRate for 15-minute buckets per machine. Each 15-minute period is only evaluated if doc_count is >= 100.",
  "analysis_config": {
    "bucket_span": "15m",
    "summary_count_field_name": "doc_count",
    "detectors": [
      {
        "function": "high_mean",
        "field_name": "RejectRate",
        "by_field_name": "machine",
        "detector_description": "High Mean of RejectRate field (with a minimum count of 100 observations per 15 minutes)."
      }
    ],
    "influencers": [ "machine" ]
  },
  "data_description": {
    "time_field": "@timestamp",
    "time_format": "epoch_ms"
  }
}

anomaly datafeed

PUT _ml/datafeeds/datafeed-my-job-with-bucket-selector
{
  "job_id": "my-job-with-bucket-selector",
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "filter": [
        {
          "exists": {
            "field": "RejectRate"
          }
        }
      ]
    }
  },
  "indices": [
    "my-machine-data-*"
  ],
  "aggregations": {
    "buckets": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "15m",
        "time_zone": "UTC"
      },
      "aggs": {
        "@timestamp": {
          "max": {
            "field": "@timestamp"
          }
        },
        "machine": {
          "terms": {
            "field": "machine.keyword",
            "size": 30
          },
          "aggs": {
            "RejectRate": {
              "avg": {
                "field": "RejectRate"
              }
            },
            "machine_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "machine_min_doc_count": "_count"
                },
                "script": "params.machine_min_doc_count > 100"
              }
            }
          }
        }
      }
    }
  },
  "scroll_size": 1000,
  "delayed_data_check_config": {
    "enabled": true
  },
  "query_delay": "120s",
  "chunking_config": {
    "mode": "manual",
    "time_span": "15m"
  }
}

When setting up a job without the machine_filter (just by removing that part of the aggregation in the datafeed), then I have a well-working job expect for the fact that some buckets having very few records are in fact also included and analyzed, which I would like to avoid by ignoring.

Any ideas of how to ignore these buckets? How do adjust the datafeed?

Also tried with a composite aggregation instead of the nested aggregation displayed above, but the outcome was the same.

system · December 6, 2023, 3:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.