Machine Learning datafeed skipping documents that seem to be there

elasticitous · March 4, 2019, 6:08pm

I'm trying to run some machine learning jobs but I getting problems with their datafeeds seemingly skipping documents that everything else indicates are there to process when the job is running.

The datafeeds run on indexes that update every 15 minutes. The bucket interval is 1 hour, the frequency is 1 hour (equal or multiple of the bucket interval), and the query_delay is set at 35 minutes (enough time for two ingestion events between job runs). The index refresh is at the default 1 second.

When the datafeed is started for the first time, everything works great. As soon as it hits the 35 minute query delay for the most recent bucket though, the datafeed reports it can't find any indexed documents and reports 99 severity anomalies due to 0 document count.

If the job is left to run, the Machine Learning Job Management row has a warning with "Documents Missing due to Ingest Latency" which also gets annotated onto the Single Metric Viewer visual in multiple places since the "switch over" between historic and live data.

"Datafeed has missed N documents due to ingest latency, latest bucket with missing data is [timestamp]. Consider increasing query_delay"

However, even right at the moment the job runs, I can see documents in "Discover" are definitely are there. I followed the advice from another thread and created a Watcher which queries the document count of that index every 5 minutes to "prove" they're really there and it sees the documents are there. If I immediately stop the datafeed right after the job runs, and recreate it, then the datafeed also sees the documents are there and doesn't report the same 0-value anomalies it literally just reported.

Whenever I set this datafeed to "live updates" though, all I get is 0s across the board when it runs, even I put the query delay at 2h which should be long, long, long after ingestion is finished.

To make things even stranger, this job worked fine in 6.4 without managing any of these delay settings, I just set the bucket interval and left everything else on default and it worked.

Why is my datafeed constantly reporting 0 data only when it's doing a live update but no other part of Kibana believes those documents aren't there?

The anomaly:

Discover:

Watcher query:

    "query": {
        "bool": {
          "filter": [
            {
              "range": {
                "click_datetime": {
                  "gte": "now-2h-35m",
                  "lte": "now-1h-35m"
                }
              }
            }
          ]
        }
      }

And results of the watcher:

   "hits": {
      "hits": [],
      "total": 15273,
      "max_score": 0
    },

I've let the job run for a few cycles and here are the results:

The data from that first auto-run always comes back zero. Previously I've turned the job off at this point I'm going to let it run overnight and see if maybe it only happens on the first auto-run and can be ignored going forward.

The annotation in that screenshot was auto-added and says this. 100% those documents were there more than 2 hours before this job ran.

richcollier · March 5, 2019, 1:49pm

I think there might be a few things going on here.

There is a bug that was introduced in v6.5 (and will be fixed in v6.6.2+) that inadvertently creates an anomaly on an interim (un-finalized, or still-open bucket). See:

and the corresponding bug:

github.com/elastic/ml-cpp

[ML] Unexpected interim results after advancing time into new empty bucket

opened 04:40PM - 30 Nov 18 UTC

closed 06:21PM - 26 Feb 19 UTC

dimitris-athanasiou

>bug :ml

**How to reproduce** 1. Create a job with simple count detector and a bucket …span of 5m 2. Run some data through the job up to the end of a bucket (using the `end` parameter of the start datafeed API) 3. Open the job again (it should have been auto-closed from step 2) 4. Call the flush API: ``` POST _xpack/anomaly_detectors/{job_id}/flush?advance_time={time}&calc_interim=true ``` where {time} should be a timestamp into the current bucket. E.g., if `end` was `2018-12-01T00:00:00Z`, {time} should be `2018-12-01T00:00:01Z` **Observed Behaviour** If you get the anomaly records, you should see a record which is interim and has an actual value of `0.0`. This shouldn't have been created. Interestingly, calling step 4 with {time} being one millisecond forward makes that record disappear. Also, this is broken since version `6.4.`

The auto-annotation for missing data, however, should not stumble onto this bug because it explicitly ignores interim buckets. In order to validate your datafeed timing (what bucket's it's querying and when), you could enable TRACE logging for the datafeed:

PUT _cluster/settings
{
  "transient": {
    "logger.org.elasticsearch.xpack.ml.datafeed": "TRACE"
  }
}

(this is a transient setting that won't survive a cluster re-start but you can always reset this back to "DEBUG" or "NORMAL" when this experiment is over)

You can also have your Watch log what it sees as well - then, in the elasticsearch.log file we should have a better understanding of when the datafeed runs and what window of time it queries - while at the same time seeing the output of your watch that is trying to also do the validation.

elasticitous · March 5, 2019, 3:42pm

I'm also getting the issue with "Interim result" anomaly but my Watcher isn't flagging those. Even after the bucket is long since closed I still have the same 0 document anomalies I did when I started the job. Also, stop and restarting the data feed from the beginning of the data doesn't fix the anomalies only deleting the job & datafeed and recreating them and then restarting it fixes them.

Strangely, I left the job run over-night with a 75 minute query delay and it worked a few times after it didn't work initially. The yellow anomaly is one of the "Interim Results" and it didn't send an Alert about it.

That's great if that's a solution but it doesn't explain why I need a 75 minute delay when the data is there 15 minutes or less after the bucket hour.

I'll set that datafeed log setting and report back thanks.

elasticitous · March 6, 2019, 7:22pm

So I tried setting that logging setting and here's what I obtained:

[2019-03-06T10:17:22,078][DEBUG][o.e.x.m.d.e.c.ChunkedDataExtractor] [my_elastic_cluster] [my_job] Aggregating Data summary response was obtained
[2019-03-06T10:17:22,078][DEBUG][o.e.x.m.d.e.c.ChunkedDataExtractor] [my_elastic_cluster] [my_job]Chunked search configured: kind = AggregatedDataSummary, dataTimeSpread = 5558399000 ms, chunk span = 3600000000 ms
[2019-03-06T10:17:22,078][TRACE][o.e.x.m.d.e.c.ChunkedDataExtractor] [my_elastic_cluster] [my_job] advances time to [1546322400000, 1549922400000)
[2019-03-06T10:17:22,078][DEBUG][o.e.x.m.d.e.a.AbstractAggregationDataExtractor] [my_elastic_cluster] [my_job] Executing aggregated search
[2019-03-06T10:17:28,528][DEBUG][o.e.x.m.d.e.a.AbstractAggregationDataExtractor] [my_elastic_cluster] [my_job] Search response was obtained
[2019-03-06T10:17:28,548][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_job] Processed another 336 records

...processing many records...

[2019-03-06T10:17:32,522][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_job] Processed another 279 records
[2019-03-06T10:17:32,522][TRACE][o.e.x.m.d.e.c.ChunkedDataExtractor] [my_elastic_cluster] [my_job] advances time to [1549922400000, 1551880800000)
[2019-03-06T10:17:32,522][DEBUG][o.e.x.m.d.e.a.AbstractAggregationDataExtractor] [my_elastic_cluster] [my_job] Executing aggregated search
[2019-03-06T10:17:36,929][DEBUG][o.e.x.m.d.e.a.AbstractAggregationDataExtractor] [my_elastic_cluster] [my_job] Search response was obtained
[2019-03-06T10:17:36,930][DEBUG][o.e.x.m.d.e.a.AggregationToJsonProcessor] [my_elastic_cluster] Skipping bucket at [1549918800000], startTime is [1549922400000]

...processing many records....

[2019-03-06T10:17:38,580][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_job] Processed another 336 records
[2019-03-06T10:17:38,693][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_job] Processed another 112 records
[2019-03-06T10:17:38,693][DEBUG][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_job] Complete iterating data extractor [null], [10807], [1551881836274], [true], [false]
[2019-03-06T10:17:38,693][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_job] Sending flush request
[2019-03-06T10:17:39,947][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_job] Sending persist request
[2019-03-06T10:17:39,947][INFO ][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_job] Lookback has finished
[2019-03-06T10:17:39,947][DEBUG][o.e.x.m.d.DatafeedManager] [my_elastic_cluster] Waiting [42.3m] before executing next realtime import for job [my_job]
[2019-03-06T10:17:39,947][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [my_elastic_cluster] [my_job] [autodetect/40821] [CAnomalyJob.cc@1352] Pruning all models
[2019-03-06T10:17:39,947][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [my_elastic_cluster] [my_job] [autodetect/40821] [CAnomalyJob.cc@996] Background persist starting data copy
[2019-03-06T10:17:39,948][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [my_elastic_cluster] [my_job] [autodetect/40821] [CBackgroundPersister.cc@186] Background persist starting background thread
[2019-03-06T10:23:12,975][INFO ][o.e.c.m.MetaDataIndexTemplateService] [my_elastic_cluster] adding template [kibana_index_template:.kibana] for index patterns [.kibana]
[2019-03-06T10:23:13,006][INFO ][o.e.c.m.MetaDataMappingService] [my_elastic_cluster] [.kibana_1/gSRaPMLhTc6K2bvQXVysIQ] update_mapping [doc]
[2019-03-06T10:23:46,412][INFO ][o.e.c.m.MetaDataIndexTemplateService] [my_elastic_cluster] adding template [kibana_index_template:.kibana] for index patterns [.kibana]
[2019-03-06T10:23:46,438][INFO ][o.e.c.m.MetaDataMappingService] [my_elastic_cluster] [.kibana_1/gSRaPMLhTc6K2bvQXVysIQ] update_mapping [doc]
[2019-03-06T11:00:00,101][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_job] Searching data in: [1551881836275, 1551884400000)
[2019-03-06T11:00:00,101][DEBUG][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_job] Complete iterating data extractor [null], [0], [1551884399999], [true], [false]
[2019-03-06T11:00:00,101][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_job] Sending flush request
[2019-03-06T11:00:00,639][DEBUG][o.e.x.m.d.DatafeedManager] [my_elastic_cluster] Waiting [59.9m] before executing next realtime import for job [my_job]

I see searches for data from 8am to 9am represented by:

[2019-03-06T10:17:32,522][TRACE][o.e.x.m.d.e.c.ChunkedDataExtractor] [my_elastic_cluster] [my_job] advances time to [1549922400000, 1551880800000)

That seems to advance the time to 8am.

[2019-03-06T10:17:38,693][DEBUG][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_job] Complete iterating data extractor [null], [10807], [1551881836274], [true], [false]

That seems to bring it to 8:17am for some reason.

[2019-03-06T11:00:00,101][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_job] Searching data in: [1551881836275, 1551884400000)

That brings it to 9am but there's no "processing" rows logged afterward.

And sure enough there's a 0 document count for that bucket:

Here was the output from a Watcher which ran the same query at 11:01. For some reason this Watcher isn't actually writing to the .log file but I guess that's a different problem:

"actions": [
  {
    "id": "my-logging-action",
    "type": "logging",
    "status": "success",
    "logging": {
      "logged_text": "There are 14240 documents in the index for the hour before the last hour (35 min delay) - measured at 2019-03-06T17:01:09.601Z"
    }
  }
]

So both Discover and Watcher confirm documents in the 8am - 9am bucket, but the TRACE output for the ML job seems to stop at 8:17am, restart later, and then skip every document in that bucket and just close itself.

Any idea what's going on?

dmitri · March 7, 2019, 9:29am

In the log entries, after the log level you can see the code class that issues the message, e.g. [o.e.x.m.d.DatafeedJob ].

Could you please all log entries whose class starts with o.e.x.m.d.? We need to see them all to understand what is going on here.

Also, could you paste the configuration of the job and the datafeed?

(You can obfuscate any field names of your data as needed)

elasticitous · March 7, 2019, 3:54pm

I fixed my Watcher and found out that it is now logging 0 documents right before the ML job fires.

I didn't quite understand the log files were broken up by machines and the Watcher entry and job entries were in different files.

I'm going to do some investigation into the ingestion pipeline since Watcher and ML are both reporting an ingestion problem now.

richcollier · March 7, 2019, 5:13pm

This is a good discovery - nice detective work. Keep us posted as to what you find!

elasticitous · March 7, 2019, 10:11pm

Unfortunately there was an additional problem with ingestion that was an actual ingestion problem, but after fixing that, the original problem persists. I captured a good example of it here:

Here's the 0 count anomaly triggered on the 2pm to 3pm bucket:

Monitoring shows there were no interruptions to ingestion:

And here are the complete logs including the Watchers from the time I started the datafeed until the time it sent the anomaly alert:

Here's the first part of the logs:

[2019-03-07T14:13:59,352][INFO ][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Datafeed started (from: 2019-03-07T17:59:59.001Z to: real-time) with frequency [3600000ms]
[2019-03-07T14:13:59,352][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Searching data in: [1551981599001, 1551988439352)
[2019-03-07T14:13:59,372][DEBUG][o.e.x.m.d.e.c.ChunkedDataExtractor] [my_elastic_cluster] [my_ml_job] Aggregating Data summary response was obtained
[2019-03-07T14:13:59,372][DEBUG][o.e.x.m.d.e.c.ChunkedDataExtractor] [my_elastic_cluster] [my_ml_job]Chunked search configured: kind = AggregatedDataSummary, dataTimeSpread = 3599000 ms, chunk span = 3600000000 ms
[2019-03-07T14:13:59,373][TRACE][o.e.x.m.d.e.c.ChunkedDataExtractor] [my_elastic_cluster] [my_ml_job] advances time to [1551981600000, 1551985200000)
[2019-03-07T14:13:59,373][DEBUG][o.e.x.m.d.e.a.AbstractAggregationDataExtractor] [my_elastic_cluster] [my_ml_job] Executing aggregated search
[2019-03-07T14:13:59,387][DEBUG][o.e.x.m.d.e.a.AbstractAggregationDataExtractor] [my_elastic_cluster] [my_ml_job] Search response was obtained
[2019-03-07T14:13:59,387][DEBUG][o.e.x.m.d.e.a.AggregationToJsonProcessor] [my_elastic_cluster] Skipping bucket at [1551978000000], startTime is [1551981600000]
[2019-03-07T14:13:59,403][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Processed another 1 records
[2019-03-07T14:13:59,404][DEBUG][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Complete iterating data extractor [null], [1], [1551988439351], [true], [false]
[2019-03-07T14:13:59,404][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Sending flush request
[2019-03-07T14:13:59,474][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Sending persist request
[2019-03-07T14:13:59,474][INFO ][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Lookback has finished
[2019-03-07T14:13:59,475][DEBUG][o.e.x.m.d.DatafeedManager] [my_elastic_cluster] Waiting [1.1h] before executing next realtime import for job [my_ml_job]
[2019-03-07T14:13:59,475][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [my_elastic_cluster] [my_ml_job] [autodetect/28913] [CAnomalyJob.cc@1352] Pruning all models
[2019-03-07T14:13:59,475][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [my_elastic_cluster] [my_ml_job] [autodetect/28913] [CAnomalyJob.cc@996] Background persist starting data copy
[2019-03-07T14:13:59,475][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [my_elastic_cluster] [my_ml_job] [autodetect/28913] [CBackgroundPersister.cc@186] Background persist starting background thread
[2019-03-07T14:32:29,623][INFO ][o.e.x.m.d.DatafeedManager] [my_elastic_cluster] [stop_datafeed (api)] attempt to stop datafeed [my_ml_job] [132]
[2019-03-07T14:32:29,623][INFO ][o.e.x.m.d.DatafeedManager] [my_elastic_cluster] [stop_datafeed (api)] attempt to stop datafeed [my_ml_job] for job [my_ml_job]
[2019-03-07T14:32:29,623][INFO ][o.e.x.m.d.DatafeedManager] [my_elastic_cluster] [stop_datafeed (api)] try lock [5m] to stop datafeed [my_ml_job] for job [my_ml_job]...
[2019-03-07T14:32:29,623][INFO ][o.e.x.m.d.DatafeedManager] [my_elastic_cluster] [stop_datafeed (api)] stopping datafeed [my_ml_job] for job [my_ml_job], acquired [true]...
[2019-03-07T14:32:29,624][INFO ][o.e.x.m.d.DatafeedManager] [my_elastic_cluster] [stop_datafeed (api)] datafeed [my_ml_job] for job [my_ml_job] has been stopped
[2019-03-07T14:32:36,253][INFO ][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Datafeed started (from: 2019-03-07T18:59:59.001Z to: real-time) with frequency [3600000ms]
[2019-03-07T14:32:36,253][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Searching data in: [1551985199001, 1551989556253)
[2019-03-07T14:32:36,271][DEBUG][o.e.x.m.d.e.c.ChunkedDataExtractor] [my_elastic_cluster] [my_ml_job] Aggregating Data summary response was obtained
[2019-03-07T14:32:36,272][DEBUG][o.e.x.m.d.e.c.ChunkedDataExtractor] [my_elastic_cluster] [my_ml_job]Chunked search configured: kind = AggregatedDataSummary, dataTimeSpread = 3599000 ms, chunk span = 3600000000 ms

elasticitous · March 7, 2019, 10:12pm

Here's the second half:

[2019-03-07T14:32:36,272][TRACE][o.e.x.m.d.e.c.ChunkedDataExtractor] [my_elastic_cluster] [my_ml_job] advances time to [1551985200000, 1551988800000)
[2019-03-07T14:32:36,272][DEBUG][o.e.x.m.d.e.a.AbstractAggregationDataExtractor] [my_elastic_cluster] [my_ml_job] Executing aggregated search
[2019-03-07T14:32:36,285][DEBUG][o.e.x.m.d.e.a.AbstractAggregationDataExtractor] [my_elastic_cluster] [my_ml_job] Search response was obtained
[2019-03-07T14:32:36,286][DEBUG][o.e.x.m.d.e.a.AggregationToJsonProcessor] [my_elastic_cluster] Skipping bucket at [1551981600000], startTime is [1551985200000]
[2019-03-07T14:32:36,292][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Processed another 1 records
[2019-03-07T14:32:36,292][DEBUG][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Complete iterating data extractor [null], [1], [1551989556252], [true], [false]
[2019-03-07T14:32:36,292][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Sending flush request
[2019-03-07T14:32:36,361][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Sending persist request
[2019-03-07T14:32:36,361][INFO ][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Lookback has finished
[2019-03-07T14:32:36,361][DEBUG][o.e.x.m.d.DatafeedManager] [my_elastic_cluster] Waiting [47.3m] before executing next realtime import for job [my_ml_job]
[2019-03-07T14:32:36,361][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [my_elastic_cluster] [my_ml_job] [autodetect/28913] [CAnomalyJob.cc@1352] Pruning all models
[2019-03-07T14:32:36,361][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [my_elastic_cluster] [my_ml_job] [autodetect/28913] [CAnomalyJob.cc@996] Background persist starting data copy
[2019-03-07T14:32:36,362][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [my_elastic_cluster] [my_ml_job] [autodetect/28913] [CBackgroundPersister.cc@186] Background persist starting background thread
[2019-03-07T15:16:09,639][INFO ][o.e.x.w.a.l.ExecutableLoggingAction] [tw-prd-es02.cotterweb.local] There are 16607 documents in the index for the hour before the last hour (20 min delay) - measured at 2019-03-07T21:16:09.626Z
[2019-03-07T15:20:00,101][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Searching data in: [1551989556253, 1551992400000)
[2019-03-07T15:20:00,101][DEBUG][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Complete iterating data extractor [null], [0], [1551992399999], [true], [false]
[2019-03-07T15:20:00,101][TRACE][o.e.x.m.d.DatafeedJob    ] [my_elastic_cluster] [my_ml_job] Sending flush request
[2019-03-07T15:20:00,491][DEBUG][o.e.x.m.d.DatafeedManager] [my_elastic_cluster] Waiting [59.9m] before executing next realtime import for job [my_ml_job]
[2019-03-07T15:20:58,604][INFO ][o.e.x.w.a.l.ExecutableLoggingAction] [my_elastic_cluster] Alert for job [my_ml_job] at [2019-03-07T20:00:00.000Z] score [98]
[2019-03-07T15:20:58,911][INFO ][o.e.c.m.MetaDataMappingService] [my_elastic_cluster] [.watcher-history-9-2019.03.07/CxjDUHFcRtWyviX--4oV3Q] update_mapping [doc]
[2019-03-07T15:21:09,349][INFO ][o.e.x.w.a.l.ExecutableLoggingAction] [tw-prd-es02.cotterweb.local] There are 16556 documents in the index for the hour before the last hour (20 min delay) - measured at 2019-03-07T21:21:09.337Z
[2019-03-07T15:22:10,627][INFO ][o.e.x.w.a.l.ExecutableLoggingAction] [my_elastic_cluster] Alert for job [my_ml_job] at [2019-03-07T20:00:00.000Z] score [98]
[2019-03-07T15:23:22,649][INFO ][o.e.x.w.a.l.ExecutableLoggingAction] [my_elastic_cluster] Alert for job [my_ml_job] at [2019-03-07T20:00:00.000Z] score [98]
[2019-03-07T15:24:34,673][INFO ][o.e.x.w.a.l.ExecutableLoggingAction] [my_elastic_cluster] Alert for job [my_ml_job] at [2019-03-07T20:00:00.000Z] score [98]
[2019-03-07T15:25:46,726][INFO ][o.e.x.w.a.l.ExecutableLoggingAction] [my_elastic_cluster] Alert for job [my_ml_job] at [2019-03-07T20:00:00.000Z] score [98]
[2019-03-07T15:26:09,478][INFO ][o.e.x.w.a.l.ExecutableLoggingAction] [tw-prd-es02.cotterweb.local] There are 16489 documents in the index for the hour before the last hour (20 min delay) - measured at 2019-03-07T21:26:09.466Z

elasticitous · March 7, 2019, 10:16pm

Here's the job and datafeed config.

Job:

{
  "description": "my_ml_job",
  "analysis_config": {
    "bucket_span": "1h",
    "detectors": [
      {
        "detector_description": "MyKPI",
        "function": "count"
      }
    ],
    "summary_count_field_name": "doc_count"
  },
  "model_plot_config": {"enabled": "true"},
  "data_description": {"time_field": "my_timefield"}
}

Datafeed:

{
  "job_id": "my_ml_job",
  "indices": ["my_ml_job*"],
  "frequency": "1h",
  "query_delay": "20m",
  "query": {
    "bool": {
        "must": [
            {"match": {"field1": "value1"}}
        ]
    }
  },
  "aggs": {
    "buckets": {
      "date_histogram": {
        "field": "my_timefield",
        "interval": "1h",
        "time_zone": "UTC"
      },
      "aggs": {
        "my_timefield": {"max": {"field": "my_timefield"}}
      }
    }
  }
}

Here's the Watcher query:

  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "field1": "value1"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "click_datetime": {
              "gte": "now-20m-1h",
              "lte": "now-20m"
            }
          }
        }
      ]
    }
  }

Also, I've confirmed via 2 internal logs that ingestion is running without error.

I should note that this job ran perfectly for weeks in 6.4 with no such errors.

dmitri · March 8, 2019, 1:51pm

Thank you for providing the details we requested.

I have reproduced the issue. The problem is that, when aggregations are used in the datafeed, the first real-time search after the loopback is completed skips a histogram bucket. Note that subsequent histogram buckets should be retrieved correctly.

I have also verified that this was also an issue in 6.4. So I am surprised that you are saying there was no such issue in that version.

I have raised the issue in https://github.com/elastic/elasticsearch/issues/39842. You can track progress there.

Thank you very much for helping us detect this issue.

elasticitous · March 8, 2019, 3:43pm

Thanks! At least I know I'm not crazy.

I never saw the annotations and warnings in 6.4 but now that you mention it maybe there was a drop to 0 count when I turned it on for the first time. I didn't have Watcher's set up at that time.

I thought this was an ongoing problem because there was actually an issue in the ingestion pipeline which the job was actually detecting which we fixed. So put another one in the win column for this feature.

Thanks for all the help.

Kim-Kruse-Hansen · March 9, 2019, 2:55pm

Hi

Just thought I would join this topic.

I can confirm, that I think something changed from 6.4 to 6.5. The amount of false alarms from ML went crazy high after 6.5. It is still there in 6.6.1. So I was thinking to raise a support ticket , but havent gotten around to it.

But happy to see that 6.6.2 might be fixed

regards
Kim

dmitri · March 12, 2019, 9:45am

@Kim-Kruse-Hansen There are 2 issues discussed in this thread.

The issue that is fixed in 6.6.2 is the one where ML could generate interim anomalies when it shouldn't (https://github.com/elastic/ml-cpp/issues/324).

The other is the one that was raised due to this thread: https://github.com/elastic/elasticsearch/issues/39842. Note the fix for this will not make 6.6.2.

If the issues you experience do not seem related to one of the above, then please raise a support issue.

system · April 9, 2019, 9:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ML Datafeed lookback retrieved no data Elasticsearch elastic-stack-machine-learning	10	2523	April 17, 2018
Machine learning jobs not reflecting new data Elasticsearch elastic-stack-machine-learning	5	873	October 30, 2018
Anomaly Detection Kibana skipping data Kibana elastic-stack-machine-learning	12	946	July 15, 2020
ML - Datafeed is encountering errors extracting data: all shards failed Elasticsearch elastic-stack-machine-learning	24	2831	March 24, 2021
Datafeed stops immediately when performing custom machine learning job Elasticsearch	6	1434	June 28, 2017

Machine Learning datafeed skipping documents that seem to be there

Related topics