Gaps in machine learning data when reverting to snapshot with high query delay

tvannet · November 6, 2017, 3:00pm

Hi,

I have a job running with a fairly high query delay (30 minutes). The bucket size itself is only 15 minutes, although I don't believe that's relevant.

Occasionally, I have to revert the model to a previous state due to one reason or another.

I proceed as follows :

POST _xpack/ml/datafeeds/<datafeedid>/_stop
POST _xpack/ml/anomaly_detectors/<jobid>/_close
GET _xpack/ml/anomaly_detectors/<jobid>/model_snapshots/
POST _xpack/ml/anomaly_detectors/<jobid>/model_snapshots/<modelid>/_revert
{
	"delete_intervening_results": true
}

POST _xpack/ml/anomaly_detectors/<jobid>/_open
POST _xpack/ml/datafeeds/<datafeedid>/_start
{
  "start": "0"
}

Every time I do that, I notice there is no data for 30 minutes following the date I reverted to. I assume this is due to the way the query delay is handled. I have tried it on a variety of jobs and the problem happens systematically.

The problem also seems to happen when the model is automatically reverted due to a system failure on the node running the job.

Is this due to a bug or am I doing something incorrectly?

Thanks,

Thomas

dmitri · November 6, 2017, 5:57pm

Hi Thomas,

May I ask you what version you are running on?

Also, could you provide an example of the reverted model snapshot's latest_record_time_stamp and the actual time when you performed the revert? Finally, it would also help to know the datafeed's frequency.

tvannet · November 7, 2017, 8:42am

Hi,

I am running version 5.6.3

Here's a detailed example:

bucket span 900s
query delay 1800s
frequency 450s

before revert:

job   "latest_record_timestamp": 1510012781000 => 06/11/2017 23:59:41
job   "last_data_time": 1510015052888 => 07/11/2017 00:37:32

Reverting to:

model id: 1510001553 => 06/11/2017 20:52:33
model "latest_record_time_stamp": 1509998397000 => 06/11/2017 19:59:57
model "latest_result_time_stamp": 1509997500000 => 06/11/2017 19:45:00

after revert:

job  "latest_record_timestamp": 1509998397000 => 06/11/2017 19:59:57
job  "last_data_time": 1510015052888 => 07/11/2017 00:37:32

After restarting the feed (starting from timestamp 0), I have no data on the buckets at 19:45 and 20:00.

dmitri · November 8, 2017, 1:12pm

Hi Thomas,

I have tried to recreate a similar situation and I couldn't reproduce the issue. However, I think we'll need to understand more details about this job.

In general, when the model snapshot is reverted and its results deleted, once the job is reopened, time should have rolled back to what the snapshot knows. Then, once the datafeed is started, it should perform a search for data up to now - query_delay. In your example, now should be close to midnight. That should have definitely created results for the buckets in question.

In order to understand more about what is happening, could you provide the job stats, as well as the logs from the node that has been running the job?

Another question would be whether there are data to be pulled from after 19:59:57. It seems that there is some operational reason to perform the model snapshot revert. It would be helpful to know if you're taking actions against the indices that hold the analysed data as well.

tvannet · November 8, 2017, 4:03pm

Hi,

Thanks for your answer! I did some more testing on a healthy single metric, count job and observed some curious behavior:

I am using a job with a 1h bucket time. It is healthy, ie every bucket contains the full data for that time frame and no error has occurred.
When reverting to a snapshot with a latest record at 15:03, the bucket starting at 15:00 seems to contain the full data from 15:00 to 16:00 as it did before the revert. That is already weird since the model shouldn't have known about the data past 15:03.
After restarting the job after the revert, the bucket starting at 15:00 now only contains a little bit of data (I assume it only contains the data from 15:00 to 15:03), although the 16:00 bucket contains the full data.
Then I reverted the job to the same model again. Now this time the bucket starting at 15:00 only contains the same little bit of data it did after the previous restart.
If I restart the job again then as before the 15:00 bucket doesn't receive any new data and the 16:00 bucket contains the full data as expected.

Here are the relevant logs: (note that some dates are CET (UTC+1) and some are UTC)

[2017-11-08T16:31:42,048][INFO ][o.e.x.m.a.RevertModelSnapshotAction$TransportAction] [<nodeid>] Reverting to snapshot '1510063705'
[2017-11-08T16:31:42,542][INFO ][o.e.x.m.a.RevertModelSnapshotAction$TransportAction] [<nodeid>] Deleting results after 'Tue Nov 07 15:00:00 CET 2017'
[2017-11-08T16:31:50,576][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [<nodeid>] Opening job [<jobid>]
[2017-11-08T16:31:50,600][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [<nodeid>] [<jobid>] Loading model snapshot [1510063705] with latest_record_timestamp [2017-11-07T14:03:33.000Z], job latest_record_timestamp [2017
-11-07T14:03:33.000Z]
[2017-11-08T16:31:50,601][INFO ][o.e.x.m.j.p.a.NativeAutodetectProcessFactory] Restoring quantiles for job '<jobid>'
[2017-11-08T16:31:50,899][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [<jobid>] [autodetect/19951] [CAnomalyDetector.cc@974] Processing is already complete to time 1510063200
[2017-11-08T16:31:51,993][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [<nodeid>] Successfully set job state to [opened] for job [<jobid>]
[2017-11-08T16:35:31,856][INFO ][o.e.x.m.d.DatafeedManager] Starting datafeed [datafeed-<jobid>] for job [<jobid>] in [1970-01-01T00:00:00.000Z, forever)
[2017-11-08T16:35:33,777][INFO ][o.e.x.m.d.DatafeedJob    ] [<jobid>] Lookback has finished

What job stats would you be interested in? The data I provided in my previous message included the timestamps found in the job stats before and after revert.

dmitri · November 8, 2017, 5:04pm

Hi Thomas,

Could you please paste the job and datafeed configurations? You can find them in the JSON tab in the UI.

In addition, could you explain whether each of those 1 hour buckets contain multiple points in your data and whether those points are spread around the hour or they are polled on a specific moment?

Finally, it would be really useful if you could paste the latest bucket available from the get-buckets endpoint, right after you have done a model-snapshot revert.

tvannet · November 8, 2017, 5:30pm

Every bucket is displayed as a single point in the Kibana UI, exactly at xx:00:00

Here's the info you asked for:

Job config:

Detectors
count

Analysis Config
bucket_span: 60m
summary_count_field_name: doc_count

Data Description
time_field: timestamp
time_format: epoch_ms

Datafeed config:

datafeed_id: datafeed-<jobid>
job_id: <jobid>
query_delay: 1800s
frequency: 600s
indices: <index>
types: <types>
query: {"match_all":{"boost":1}}
aggregations: {"buckets":{"date_histogram":{"field":"timestamp","interval":900000,"offset":0,"order":{"_key":"asc"},"keyed":false,"min_doc_count":0},"aggregations":{"timestamp":{"max":{"field":"timestamp"}}}}}
scroll_size: 1000
chunking_config: {"mode":"manual","time_span":"600000000ms"}
state: stopped

This is the data for the latest bucket following a revert

{
  "job_id": "<jobid>",
  "timestamp": 1510063200000,
  "anomaly_score": 0,
  "bucket_span": 3600,
  "initial_anomaly_score": 0,
  "event_count": 159,
  "is_interim": false,
  "bucket_influencers": [],
  "processing_time_ms": 6,
  "result_type": "bucket"
},

Also, since the initial weirdness only happens once, I reproduced it on another job (same settings, different index) and tracked the specific bucket surrounding the snapshot time :

Model snapshot data:

"latest_record_time_stamp": 1509464352000,
"latest_result_time_stamp": 1509462000000,

Before revert:

{
  "job_id": "<jobid2>",
  "timestamp": 1509462000000,
  "anomaly_score": 0,
  "bucket_span": 3600,
  "initial_anomaly_score": 0,
  "event_count": 12428,
  "is_interim": false,
  "bucket_influencers": [],
  "processing_time_ms": 2,
  "result_type": "bucket"
},

Right after revert:

{
  "job_id": "<jobid2>",
  "timestamp": 1509462000000,
  "anomaly_score": 0,
  "bucket_span": 3600,
  "initial_anomaly_score": 0,
  "event_count": 12428,
  "is_interim": false,
  "bucket_influencers": [],
  "processing_time_ms": 2,
  "result_type": "bucket"
},

After restarting the datafeed:

{
  "job_id": "<jobid2>",
  "timestamp": 1509462000000,
  "anomaly_score": 0,
  "bucket_span": 3600,
  "initial_anomaly_score": 0,
  "event_count": 7598,
  "is_interim": false,
  "bucket_influencers": [],
  "processing_time_ms": 2,
  "result_type": "bucket"
},

As you can see, the event_count was still high after the revert and decreased after the datafeed start.

dmitri · November 10, 2017, 10:59am

Hi Thomas,

Thank you very much for that info. It helped immensely. We have indeed discovered an issue that causes the problem you observe.

I have raised an issue internally and we should hopefully have a fix in one of the upcoming releases.

Apologies for any inconvenience and thank you once more for helping us make the software better!

Kind regards,
Dimitris

tvannet · November 10, 2017, 12:50pm

Hi DImitris,

Thanks a lot, I will be looking forward to the fix then.

Thomas

system · December 8, 2017, 12:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.