Why I have results without actual value

Hello, I'd like to understand the reasons why the actual field can be not present in the anomaly results (see at the end of the post).

Here is the scenario:
I create a job with a daily datafeed. I start it with no end. For all the historical values, I have daily answers. but now, I wait for real time data, and I can see that the real time does not process the same way:

the data

Date RUNDOWN_TIME
January 9th 2019 11:34:48.000 236
January 9th 2019 09:49:18.000 238
January 9th 2019 08:58:19.000 236
January 9th 2019 00:05:59.000 237
January 8th 2019 14:19:07.000 236
January 8th 2019 04:48:45.000 235
January 7th 2019 21:34:08.000 236
January 7th 2019 16:40:07.000 237
January 7th 2019 12:22:27.000 236
January 7th 2019 00:34:29.000 240
January 6th 2019 14:57:10.000 236
January 6th 2019 12:43:43.000 232
January 6th 2019 11:03:39.000 232
January 6th 2019 00:44:57.000 234
January 5th 2019 17:03:55.000 236
January 5th 2019 16:32:26.000 235
January 5th 2019 09:28:19.000 224
January 5th 2019 07:35:41.000 225
January 4th 2019 23:42:26.000 233
January 4th 2019 19:06:53.000 232
January 4th 2019 12:05:42.000 232
January 4th 2019 07:51:47.000 222
January 4th 2019 07:04:50.000 225
January 3rd 2019 21:50:03.000 237
January 3rd 2019 14:54:42.000 239
January 3rd 2019 13:52:36.000 233
January 3rd 2019 04:55:19.000 235
January 3rd 2019 02:05:06.000 237

My job

{
  "job_id": "",
  "job_type": "anomaly_detector",
  "groups": [
  ],
  "analysis_config": {
    "bucket_span": "1d",
    "detectors": [
      {
        "detector_description": "min(RUNDOWN_TIME)",
        "function": "min",
        "field_name": "RUNDOWN_TIME",
        "partition_field_name": "fwot.keyword",
        "rules": []
      }
    ],
    "influencers": [
      "fwot.keyword"
    ]
  },
  "analysis_limits": {
    "model_memory_limit": "1024mb"
  },
  "data_description": {
    "time_field": "@timestamp"
  },
  "model_plot_config": {
    "enabled": true
  },
  "model_snapshot_retention_days": 1,
  "custom_settings": {
    "custom_urls": [
      {
        "url_name": "",
        "url_value": ""
      }
    ]
  },
  "datafeed_config": {
    "indices": [
      "XXX*"
    ],
    "types": [],
    "query": {
      "bool": {
        "must": [
          {
            "range": {
              "RUNDOWN_TIME": {
                "from": 10,
                "to": 1000,
                "include_lower": true,
                "include_upper": false,
                "boost": 1
              }
            }
          },
          {
            "exists": {
              "field": "RUNDOWN_TIME",
              "boost": 1
            }
          }
        ],
        "adjust_pure_negative": true,
        "boost": 1
      }
    },
    "chunking_config": {
      "mode": "auto"
    }
  }
}

The results

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 8.064557,
    "hits": [
      {
        "_index": ".ml-anomalies-shared",
        "_type": "doc",
        "_id": "rundown_time__model_plot_1546819200000_86400_0_1913124440_6",
        "_score": 8.064557,
        "_source": {
          "job_id": "rundown_time_v2",
          "result_type": "model_plot",
          "bucket_span": 86400,
          "detector_index": 0,
          "timestamp": 1546819200000,
          "partition_field_name": "fwot.keyword",
          "partition_field_value": "A",
          "model_feature": "'minimum value by person'",
          "model_lower": 217.07847285447784,
          "model_upper": 244.3672249202423,
          "model_median": 230.72284888736007
        }
      },
      {
        "_index": ".ml-anomalies-shared",
        "_type": "doc",
        "_id": "rundown_time__model_plot_1546905600000_86400_0_1913124440_6",
        "_score": 8.064557,
        "_source": {
          "job_id": "rundown_time",
          "result_type": "model_plot",
          "bucket_span": 86400,
          "detector_index": 0,
          "timestamp": 1546905600000,
          "partition_field_name": "fwot.keyword",
          "partition_field_value": "A",
          "model_feature": "'minimum value by person'",
          "model_lower": 216.99082736038537,
          "model_upper": 244.27957942614984,
          "model_median": 230.6352033932676
        }
      },
      {
        "_index": ".ml-anomalies-shared",
        "_type": "doc",
        "_id": "rundown_time__model_plot_1546732800000_86400_0_1913124440_6",
        "_score": 7.965691,
        "_source": {
          "job_id": "rundown_time",
          "result_type": "model_plot",
          "bucket_span": 86400,
          "detector_index": 0,
          "timestamp": 1546732800000,
          "partition_field_name": "fwot.keyword",
          "partition_field_value": "A",
          "model_feature": "'minimum value by person'",
          "model_lower": 217.16578366600157,
          "model_upper": 244.45453573176604,
          "model_median": 230.8101596988838,
          "actual": 232
        }
      }
    ]
  }
}

If your ML job is not seeing data once it transitions between historical and real-time and the likely culprit is that your ingest delay is bigger than the query_delay parameter on the ML job's datafeed. Increase the query_delay and allow there to be enough time for that data to be ingested and searchable first before the ML job searches for it for analysis.

See: https://www.elastic.co/guide/en/elasticsearch/reference/6.5/ml-put-datafeed.html

If your data comes in once per day, what time of day does it appear in elasticsearch? Midnight? Two minutes after midnight? Later? This will be important to know and to set the query_delay and frequency parameters correctly. For example, it the data appears reliably in elasticsearch no earlier than 3AM, then the:

bucket_span : 1d
frequency : 1d
query_delay: 3h

That way, the query is done once per day at 3AM for the previous day's data.

Hello Rich,

The data is coming in real time (few seconds lag) , somewhere within the day. May be several per day (eg;3 ) and some days you may have no data at all.

I'll try your conf, but I do not understand such behavior. Even if I run the datafeed each 2 minutes, the bucket should have 1d span? so, of course there is a lot of overlap, but it should work fine, right?

can you explain the behavior, then?

best

The datafeed requests data in chunks equal to frequency in chronological order, delayed by query_delay. For simplicity's sake, let's assume frequency=1h and query_delay=5m. Given that, then:

At 12:05am, the datafeed will request/search data that has a timestamp between 11:00pm and 12:00am
At 1:05am, the datafeed will request/search data that has a timestamp between 12:00am and 1:00am
At 2:05am, the datafeed will request/search data that has a timestamp between 1:00am and 2:00am
...and so on

Let's say your data gets inserted into elasticsearch arbitrarily at 2:00am, but the sample that is being inserted has a timestamp of 12:00am. Then there is no possible way that the ML job will ever see this data. This is because:

  • When the datafeed asked for data that was within the range of 12:00am, (at 12:05am), the data wasn't yet there.
  • When the data was there at 2:00am, the ML datafeed at 2:05 had already "moved on" to look for data with timestamps between 1:00am and 2:00am, thus missing this newly inserted data.

In other words, the time at which the data gets inserted and the timestamp of the document in the index matters a lot. The ML datafeed's configuration needs to accommodate this.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.