Any ML job I create runs for the first time and then stops and never runs

machine-learning

(Kaushik Sarkar) #1

Symptom -

Any ML job I create runs for the first time and then stops and never runs.

Here is an example Job -

{ "job_id": "tibco-celect", "job_type": "anomaly_detector", "job_version": "6.0.0", "description": "Celect Trends", "create_time": 1528730826156, "finished_time": 1528730828213, "analysis_config": { "bucket_span": "15m", "summary_count_field_name": "doc_count", "detectors": [ { "detector_description": "non_zero_count", "function": "non_zero_count", "detector_rules": [], "detector_index": 0 } ], "influencers": [] }, "data_description": { "time_field": "timestamp", "time_format": "epoch_ms" }, "model_plot_config": { "enabled": true }, "model_snapshot_retention_days": 1, "model_snapshot_id": "1528823700", "results_index_name": "custom-tibco-celect", "data_counts": { "job_id": "tibco-celect", "processed_record_count": 128, "processed_field_count": 128, "input_bytes": 5577, "input_field_count": 128, "invalid_date_count": 0, "missing_field_count": 0, "out_of_order_timestamp_count": 0, "empty_bucket_count": 2, "sparse_bucket_count": 0, "bucket_count": 128, "earliest_record_timestamp": 1528689731828, "latest_record_timestamp": 1528810168631, "last_data_time": 1528810335774, "latest_empty_bucket_timestamp": 1528705800000, "input_record_count": 128 }, "model_size_stats": { "job_id": "tibco-celect", "result_type": "model_size_stats", "model_bytes": 66506, "total_by_field_count": 3, "total_over_field_count": 0, "total_partition_field_count": 2, "bucket_allocation_failures_count": 0, "memory_status": "ok", "log_time": 1528823850000, "timestamp": 1528823700000 }, "datafeed_config": { "datafeed_id": "datafeed-tibco-celect", "job_id": "tibco-celect", "query_delay": "60ms", "frequency": "150s", "indices": [ "tibauditxml-" ], "types": [ "doc" ], "query": { "bool": { "must": [ { "query_string": { "query": "", "fields": [], "type": "best_fields", "default_operator": "or", "max_determinized_states": 10000, "enable_position_increments": true, "fuzziness": "AUTO", "fuzzy_prefix_length": 0, "fuzzy_max_expansions": 50, "phrase_slop": 0, "analyze_wildcard": true, "escape": false, "boost": 1 } }, { "match_phrase": { "EventDescription": { "query": "From Celect: Message sent to sterling successfully", "slop": 0, "boost": 1 } } } ], "adjust_pure_negative": true, "boost": 1 } }, "aggregations": { "buckets": { "date_histogram": { "field": "timestamp", "interval": 900000, "offset": 0, "order": { "_key": "asc" }, "keyed": false, "min_doc_count": 0 }, "aggregations": { "timestamp": { "max": { "field": "timestamp" } } } } }, "scroll_size": 1000, "chunking_config": { "mode": "manual", "time_span": "900000000ms" }, "state": "stopped" }, "state": "opened", "node": { "id": "w-t1K8AzSmm5UBwmQ7FpWg", "name": "w-t1K8A", "ephemeral_id": "4VahUs23RTGO0js5rRT_Gw", "transport_address": "10.75.19.172:9300", "attributes": { "ml.max_open_jobs": "10", "ml.enabled": "true" } }, "open_time": "260859s" }

Job logs -
2018-06-11 11:27:06 w-t1K8A Job created
2018-06-11 11:27:06 w-t1K8A Loading model snapshot [N/A], job latest_record_timestamp [N/A]
2018-06-11 11:27:06 w-t1K8A Opening job on node [{w-t1K8A}{w-t1K8AzSmm5UBwmQ7FpWg}{4VahUs23RTGO0js5rRT_Gw}{10.75.19.172}{10.75.19.172:9300}{ml.max_open_jobs=10, ml.enabled=true}]
2018-06-11 11:27:07 w-t1K8A Starting datafeed [datafeed-tibco-celect] on node [{w-t1K8A}{w-t1K8AzSmm5UBwmQ7FpWg}{4VahUs23RTGO0js5rRT_Gw}{10.75.19.172}{10.75.19.172:9300}{ml.max_open_jobs=10, ml.enabled=true}]
2018-06-11 11:27:07 w-t1K8A Datafeed started (from: 2018-06-11T03:26:40.100Z to: 2018-06-11T15:26:40.102Z)
2018-06-11 11:27:08 w-t1K8A Datafeed lookback completed
2018-06-11 11:27:08 w-t1K8A Datafeed stopped
2018-06-11 11:27:08 w-t1K8A Job is closing
2018-06-11 11:27:22 w-t1K8A Loading model snapshot [1528730828] with latest_record_timestamp [2018-06-11T15:14:49.000Z], job latest_record_timestamp [2018-06-11T15:14:49.575Z]
2018-06-11 11:27:22 w-t1K8A Opening job on node [{w-t1K8A}{w-t1K8AzSmm5UBwmQ7FpWg}{4VahUs23RTGO0js5rRT_Gw}{10.75.19.172}{10.75.19.172:9300}{ml.max_open_jobs=10, ml.enabled=true}]
2018-06-11 11:27:23 w-t1K8A Starting datafeed [datafeed-tibco-celect] on node [{w-t1K8A}{w-t1K8AzSmm5UBwmQ7FpWg}{4VahUs23RTGO0js5rRT_Gw}{10.75.19.172}{10.75.19.172:9300}{ml.max_open_jobs=10, ml.enabled=true}]
2018-06-11 11:27:23 w-t1K8A Datafeed started (from: 2018-06-11T15:14:49.576Z to: real-time)
2018-06-11 11:36:55 w-t1K8A Datafeed stopped
2018-06-12 09:32:11 w-t1K8A Starting datafeed [datafeed-tibco-celect] on node [{w-t1K8A}{w-t1K8AzSmm5UBwmQ7FpWg}{4VahUs23RTGO0js5rRT_Gw}{10.75.19.172}{10.75.19.172:9300}{ml.max_open_jobs=10, ml.enabled=true}]
2018-06-12 09:32:11 w-t1K8A Datafeed started (from: 2018-06-11T16:15:00.000Z to: real-time)
2018-06-12 09:32:16 w-t1K8A Datafeed continued in real-time
2018-06-12 09:32:16 w-t1K8A Datafeed lookback completed
2018-06-12 09:33:41 w-t1K8A Datafeed stopped
2018-06-12 09:34:56 w-t1K8A Starting datafeed [datafeed-tibco-celect] on node [{w-t1K8A}{w-t1K8AzSmm5UBwmQ7FpWg}{4VahUs23RTGO0js5rRT_Gw}{10.75.19.172}{10.75.19.172:9300}{ml.max_open_jobs=10, ml.enabled=true}]
2018-06-12 09:34:56 w-t1K8A Datafeed started in real-time
2018-06-12 10:55:00 w-t1K8A Datafeed has been retrieving no data for a while
2018-06-12 14:34:14 w-t1K8A Datafeed stopped


(rich collier) #2

did you change your query_delay value manually? I see in your job config:

"query_delay":"60ms",

The above value is likely too small. The default is 60s


(rich collier) #3

Spoke to this user in a different communications channel. The root cause of the issue was a subtle bug in the datafeed configuration that was introduced in v6.0 and fixed soon after (in 6.0.2 and from 6.1.0 onwards).

To workaround the problem, simply set datafeed frequency equal to bucket_span for count functions


(Kaushik Sarkar) #4

Yes this work very well. Thank you. Elasticsearch 6.0.0 have this bug.


(Mark Walkom) #5