Is the use of bucket_selector
supported i datafeed for an anomaly detection job?
The problem
If I set up a job having a bucket_selector
in the datafeed, then the job only processes a few initial buckets and then does not really process any more buckets, whereas the job processes all the historic data and catches up to real-time if I leave out the bucket_selector
from the datafeed.
In both cases the preview of the datafeed is similar and looks like this:
[
{
"@timestamp": 1697940899098,
"machine": "826_SPS",
"RejectRate": 0.023383768913342505,
"doc_count": 2181
},
{
"@timestamp": 1697940899098,
"machine": "876_LPS",
"RejectRate": 0.03416856492027335,
"doc_count": 439
}
]
The dectector is high_mean(RejectRate)
which is analysed for 15-minute buckets. During some 15-minute buckets the number of documents (items handled per machine) is low and the average RejectRate is likely to cause anomalies in these situations. That is way I wanted to filter out buckets having less than 100 documents per machine.
anomaly job
PUT _ml/anomaly_detectors/my-job-with-bucket-filter
{
"description": "Monitoring the RejectRate for 15-minute buckets per machine. Each 15-minute period is only evaluated if doc_count is >= 100.",
"analysis_config": {
"bucket_span": "15m",
"summary_count_field_name": "doc_count",
"detectors": [
{
"function": "high_mean",
"field_name": "RejectRate",
"by_field_name": "machine",
"detector_description": "High Mean of RejectRate field (with a minimum count of 100 observations per 15 minutes)."
}
],
"influencers": [ "machine" ]
},
"data_description": {
"time_field": "@timestamp",
"time_format": "epoch_ms"
}
}
anomaly datafeed
PUT _ml/datafeeds/datafeed-my-job-with-bucket-selector
{
"job_id": "my-job-with-bucket-selector",
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": [
{
"exists": {
"field": "RejectRate"
}
}
]
}
},
"indices": [
"my-machine-data-*"
],
"aggregations": {
"buckets": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "15m",
"time_zone": "UTC"
},
"aggs": {
"@timestamp": {
"max": {
"field": "@timestamp"
}
},
"machine": {
"terms": {
"field": "machine.keyword",
"size": 30
},
"aggs": {
"RejectRate": {
"avg": {
"field": "RejectRate"
}
},
"machine_filter": {
"bucket_selector": {
"buckets_path": {
"machine_min_doc_count": "_count"
},
"script": "params.machine_min_doc_count > 100"
}
}
}
}
}
}
},
"scroll_size": 1000,
"delayed_data_check_config": {
"enabled": true
},
"query_delay": "120s",
"chunking_config": {
"mode": "manual",
"time_span": "15m"
}
}
When setting up a job without the machine_filter (just by removing that part of the aggregation in the datafeed), then I have a well-working job expect for the fact that some buckets having very few records are in fact also included and analyzed, which I would like to avoid by ignoring.
Any ideas of how to ignore these buckets? How do adjust the datafeed?
Also tried with a composite aggregation instead of the nested aggregation displayed above, but the outcome was the same.