Updating Datafeed didn't affect anomaly detection results

Hi all,

I needed to update Datafeed on one of my anomaly detection jobs. I stopped the datafeed, updated it by adding appropriate term to bool query, and started the datafeed again. Problem is that anomaly detector was acting like it was still using previous datafeed. I restarted anomaly detection job but that didn't helped either. I had to delete both datafeed and anomally and recreate them and then it finally worked.

Am I doing something wrong? What's the purpose of updating datafeed if it doesn't take any effect?

Anomaly detector:

PUT _xpack/ml/anomaly_detectors/clicks-by-affiliate-campaign
{
  "job_id": "clicks-by-affiliate-campaign",
  "description": "Clicks by affiliate campaign",
  "analysis_config": {
    "bucket_span": "1d",
    "detectors": [
      {
        "detector_description": "Clicks by affiliate campaign",
        "function": "count",
        "by_field_name": "affiliate_campaign",
        "detector_index": 0
      }
    ],
    "influencers": []
  },
  "data_description": {
    "time_field": "date",
    "time_format": "epoch_ms"
  }
}

Datafeed ( the "inbound = true" term was added):

PUT _xpack/ml/datafeeds/clicks-by-affiliate-campaign
{
  "datafeed_id": "clicks-by-affiliate-campaign",
  "job_id": "clicks-by-affiliate-campaign",
  "indices": [
    "hitpath_clicks*"
  ],
  "types": [],
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "date": {
              "from": "now-30d",
              "to": null,
              "include_lower": true,
              "include_upper": true,
              "boost": 1
            }
          }
        },
        {
          "term": {
            "inbound": {
              "value": true,
              "boost": 1
            }
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "affiliate_id": {
              "value": 0,
              "boost": 1
            }
          }
        },
        {
          "term": {
            "campaign_id": {
              "value": 0,
              "boost": 1
            }
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "script_fields": {
    "affiliate_campaign": {
      "script": {
        "source": "doc['affiliate_id'].value + ' - ' + doc['affiliate_name.keyword'].value + ' | ' + doc['campaign_id'].value + ' - ' + doc['campaign_name.keyword'].value",
        "lang": "painless"
      },
      "ignore_failure": false
    }
  },
  "scroll_size": 1000,
  "chunking_config": {
    "mode": "auto"
  },
  "delayed_data_check_config": {
    "enabled": true
  }
}

This one is going to be hard, perhaps impossible to assist you with unless there was a way for you to prove that it was truly still acting like it was previously.

I've tried to reproduce your situation (I'm using v7.6) and I could not reproduce. I used a sample data set called "farequote" and ran a count partitionfield=airline for just one day, then stopped the job. I edited the datafeed to be:

{
  "bool": {
    "must": [
      {
        "match_all": {}
      }
    ],
    "filter": [
      {
        "match_phrase": {
          "airline": {
            "query": "AAL"
          }
        }
      }
    ]
  }
}

And then continued on the rest of the data. The end result was that the datafeed only processed data for airline:AAL for the rest of the job duration (I know that the full dataset is 86,274 records and the filtered one processed 24,224)

I can also see where all of the other airlines threw anomalies when their count suddenly went to zero (because they were all getting filtered out):


(the red circle is when the filtering was put into place to take out all airlines except AAL)

In other words, it seems to work just fine!

Ohh so that's expected thing to happen? Looks like it was just my lack of understanding how datafeeds work. The problem was exactly that - missing records that I wanted to filter out was treated as an anomaly, because they started to show 0. I assumed that they won't be taken under consideration at all.

Is there any way to change datafeed and make anomaly detection "forget" about previously analyzed data, other than recreating a job?

Indeed yes - that is exactly what is expected to happen. The datafeed merely presents data to the algorithms to be modeled and analyzed.

You have two choices:

  1. Rebuild the job with the proper filtering in place from the start

  2. Just allow time to pass - the models will relatively quickly forget about those entities that no longer show up.

1 Like

Ok, everything is clear now. Thanks a lot!