Why I have results without actual value

dao · January 9, 2019, 1:33pm

Hello, I'd like to understand the reasons why the actual field can be not present in the anomaly results (see at the end of the post).

Here is the scenario:
I create a job with a daily datafeed. I start it with no end. For all the historical values, I have daily answers. but now, I wait for real time data, and I can see that the real time does not process the same way:

the data

Date		RUNDOWN_TIME
January 9th 2019 11:34:48.000		236
January 9th 2019 09:49:18.000		238
January 9th 2019 08:58:19.000		236
January 9th 2019 00:05:59.000		237
January 8th 2019 14:19:07.000		236
January 8th 2019 04:48:45.000		235
January 7th 2019 21:34:08.000		236
January 7th 2019 16:40:07.000		237
January 7th 2019 12:22:27.000		236
January 7th 2019 00:34:29.000		240
January 6th 2019 14:57:10.000		236
January 6th 2019 12:43:43.000		232
January 6th 2019 11:03:39.000		232
January 6th 2019 00:44:57.000		234
January 5th 2019 17:03:55.000		236
January 5th 2019 16:32:26.000		235
January 5th 2019 09:28:19.000		224
January 5th 2019 07:35:41.000		225
January 4th 2019 23:42:26.000		233
January 4th 2019 19:06:53.000		232
January 4th 2019 12:05:42.000		232
January 4th 2019 07:51:47.000		222
January 4th 2019 07:04:50.000		225
January 3rd 2019 21:50:03.000		237
January 3rd 2019 14:54:42.000		239
January 3rd 2019 13:52:36.000		233
January 3rd 2019 04:55:19.000		235
January 3rd 2019 02:05:06.000		237

My job

{
  "job_id": "",
  "job_type": "anomaly_detector",
  "groups": [
  ],
  "analysis_config": {
    "bucket_span": "1d",
    "detectors": [
      {
        "detector_description": "min(RUNDOWN_TIME)",
        "function": "min",
        "field_name": "RUNDOWN_TIME",
        "partition_field_name": "fwot.keyword",
        "rules": []
      }
    ],
    "influencers": [
      "fwot.keyword"
    ]
  },
  "analysis_limits": {
    "model_memory_limit": "1024mb"
  },
  "data_description": {
    "time_field": "@timestamp"
  },
  "model_plot_config": {
    "enabled": true
  },
  "model_snapshot_retention_days": 1,
  "custom_settings": {
    "custom_urls": [
      {
        "url_name": "",
        "url_value": ""
      }
    ]
  },
  "datafeed_config": {
    "indices": [
      "XXX*"
    ],
    "types": [],
    "query": {
      "bool": {
        "must": [
          {
            "range": {
              "RUNDOWN_TIME": {
                "from": 10,
                "to": 1000,
                "include_lower": true,
                "include_upper": false,
                "boost": 1
              }
            }
          },
          {
            "exists": {
              "field": "RUNDOWN_TIME",
              "boost": 1
            }
          }
        ],
        "adjust_pure_negative": true,
        "boost": 1
      }
    },
    "chunking_config": {
      "mode": "auto"
    }
  }
}

The results

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 8.064557,
    "hits": [
      {
        "_index": ".ml-anomalies-shared",
        "_type": "doc",
        "_id": "rundown_time__model_plot_1546819200000_86400_0_1913124440_6",
        "_score": 8.064557,
        "_source": {
          "job_id": "rundown_time_v2",
          "result_type": "model_plot",
          "bucket_span": 86400,
          "detector_index": 0,
          "timestamp": 1546819200000,
          "partition_field_name": "fwot.keyword",
          "partition_field_value": "A",
          "model_feature": "'minimum value by person'",
          "model_lower": 217.07847285447784,
          "model_upper": 244.3672249202423,
          "model_median": 230.72284888736007
        }
      },
      {
        "_index": ".ml-anomalies-shared",
        "_type": "doc",
        "_id": "rundown_time__model_plot_1546905600000_86400_0_1913124440_6",
        "_score": 8.064557,
        "_source": {
          "job_id": "rundown_time",
          "result_type": "model_plot",
          "bucket_span": 86400,
          "detector_index": 0,
          "timestamp": 1546905600000,
          "partition_field_name": "fwot.keyword",
          "partition_field_value": "A",
          "model_feature": "'minimum value by person'",
          "model_lower": 216.99082736038537,
          "model_upper": 244.27957942614984,
          "model_median": 230.6352033932676
        }
      },
      {
        "_index": ".ml-anomalies-shared",
        "_type": "doc",
        "_id": "rundown_time__model_plot_1546732800000_86400_0_1913124440_6",
        "_score": 7.965691,
        "_source": {
          "job_id": "rundown_time",
          "result_type": "model_plot",
          "bucket_span": 86400,
          "detector_index": 0,
          "timestamp": 1546732800000,
          "partition_field_name": "fwot.keyword",
          "partition_field_value": "A",
          "model_feature": "'minimum value by person'",
          "model_lower": 217.16578366600157,
          "model_upper": 244.45453573176604,
          "model_median": 230.8101596988838,
          "actual": 232
        }
      }
    ]
  }
}

richcollier · January 9, 2019, 8:29pm

If your ML job is not seeing data once it transitions between historical and real-time and the likely culprit is that your ingest delay is bigger than the query_delay parameter on the ML job's datafeed. Increase the query_delay and allow there to be enough time for that data to be ingested and searchable first before the ML job searches for it for analysis.

See: https://www.elastic.co/guide/en/elasticsearch/reference/6.5/ml-put-datafeed.html

If your data comes in once per day, what time of day does it appear in elasticsearch? Midnight? Two minutes after midnight? Later? This will be important to know and to set the query_delay and frequency parameters correctly. For example, it the data appears reliably in elasticsearch no earlier than 3AM, then the:

bucket_span : 1d
frequency : 1d
query_delay: 3h

That way, the query is done once per day at 3AM for the previous day's data.

dao · January 10, 2019, 1:18pm

Hello Rich,

The data is coming in real time (few seconds lag) , somewhere within the day. May be several per day (eg;3 ) and some days you may have no data at all.

I'll try your conf, but I do not understand such behavior. Even if I run the datafeed each 2 minutes, the bucket should have 1d span? so, of course there is a lot of overlap, but it should work fine, right?

can you explain the behavior, then?

best

richcollier · January 10, 2019, 1:31pm

The datafeed requests data in chunks equal to frequency in chronological order, delayed by query_delay. For simplicity's sake, let's assume frequency=1h and query_delay=5m. Given that, then:

At 12:05am, the datafeed will request/search data that has a timestamp between 11:00pm and 12:00am
At 1:05am, the datafeed will request/search data that has a timestamp between 12:00am and 1:00am
At 2:05am, the datafeed will request/search data that has a timestamp between 1:00am and 2:00am
...and so on

Let's say your data gets inserted into elasticsearch arbitrarily at 2:00am, but the sample that is being inserted has a timestamp of 12:00am. Then there is no possible way that the ML job will ever see this data. This is because:

When the datafeed asked for data that was within the range of 12:00am, (at 12:05am), the data wasn't yet there.
When the data was there at 2:00am, the ML datafeed at 2:05 had already "moved on" to look for data with timestamps between 1:00am and 2:00am, thus missing this newly inserted data.

In other words, the time at which the data gets inserted and the timestamp of the document in the index matters a lot. The ML datafeed's configuration needs to accommodate this.

system · February 7, 2019, 1:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ML Datafeed lookback retrieved no data Elasticsearch elastic-stack-machine-learning	10	2524	April 17, 2018
Machine Learning datafeed skipping documents that seem to be there Elasticsearch elastic-stack-machine-learning	14	4295	April 9, 2019
Advanced job can not be 'realtime search' with xpack5.4 machine learning Elasticsearch elastic-stack-machine-learning	10	1353	July 21, 2017
Security Analytics Recipes Elasticsearch	6	1336	August 25, 2017
Machine Learning Real-Time Job stopping after initial run Elasticsearch elastic-stack-machine-learning	9	4007	September 14, 2017

Why I have results without actual value

the data

My job

The results

Related topics