Machine learning doesn't work but shows no error

machine-learning

#1

HI, we are having our elastic stack version 6.1.1. in production and now we are testing x-pack with machine learning.

I have just run basic job - count of all data with bucket span 15m. I click on "create job" and run it. It doesn!t show any errors, after while it finish with no results.

But in job management the status is failed.

|2018-05-22 11:15:09|YDSYNs-|Job created|
|2018-05-22 11:15:09|YDSYNs-|Opening job on node [{YDSYNs-}{YDSYNs-ZRtO_PEr1h0Lg8w} 
{3EMovHDWSDOkoqTYLZ2uZw}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=3874136064, 
ml.max_open_jobs=20, ml.enabled=true}]|
|2018-05-22 11:15:09|YDSYNs-|Loading model snapshot [N/A], job latest_record_timestamp [N/A]|
|2018-05-22 11:15:10|YDSYNs-|Starting datafeed [datafeed-pokus11] on node [{YDSYNs-}{YDSYNs- 
ZRtO_PEr1h0Lg8w}{3EMovHDWSDOkoqTYLZ2uZw}{127.0.0.1}{127.0.0.1:9300} 
{ml.machine_memory=3874136064, ml.max_open_jobs=20, ml.enabled=true}]|
|2018-05-22 11:15:10|YDSYNs-|Datafeed started (from: 2018-05-21T09:14:10.390Z to: 2018-05- 
22T09:14:10.393Z) with frequency [900000ms]|
|2018-05-22 11:15:10|YDSYNs-|Datafeed lookback completed|
|2018-05-22 11:15:10|YDSYNs-|Datafeed stopped|

Any idea why?


#2

{"job_id":"pokus11","job_type":"anomaly_detector","job_version":"6.1.1","create_time":1526980508050,"analysis_config":{"bucket_span":"15m","summary_count_field_name":"doc_count","detectors":[{"detector_description":"count","function":"count","detector_rules":[],"detector_index":0}],"influencers":[]},"analysis_limits":{"model_memory_limit":"10mb"},"data_description":{"time_field":"@timestamp","time_format":"epoch_ms"},"model_plot_config":{"enabled":true},"model_snapshot_retention_days":1,"results_index_name":"custom-pokus11","data_counts":{"job_id":"pokus11","processed_record_count":95,"processed_field_count":95,"input_bytes":4359,"input_field_count":95,"invalid_date_count":0,"missing_field_count":0,"out_of_order_timestamp_count":0,"empty_bucket_count":0,"sparse_bucket_count":0,"bucket_count":94,"earliest_record_timestamp":1526894999000,"latest_record_timestamp":1526979598360,"last_data_time":1526980510645,"input_record_count":95},"model_size_stats":{"job_id":"pokus11","result_type":"model_size_stats","model_bytes":0,"total_by_field_count":0,"total_over_field_count":0,"total_partition_field_count":0,"bucket_allocation_failures_count":0,"memory_status":"ok","log_time":1526981923804},"datafeed_config":{"datafeed_id":"datafeed-pokus11","job_id":"pokus11","query_delay":"111752ms","indices":["logstash*"],"types":[],"query":{"match_all":{"boost":1}},"aggregations":{"buckets":{"date_histogram":{"field":"@timestamp","interval":900000,"offset":0,"order":{"_key":"asc"},"keyed":false,"min_doc_count":0},"aggregations":{"@timestamp":{"max":{"field":"@timestamp"}}}}},"scroll_size":1000,"chunking_config":{"mode":"manual","time_span":"900000000ms"},"state":"stopped"},"state":"failed"}


#3

Here is elasticsearch log

[2018-05-22T13:17:39,911][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [YDSYNs-] Opening job         
[pokus3]
[2018-05-22T13:17:39,917][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [YDSYNs-] [pokus3]     
Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2018-05-22T13:17:40,627][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [pokus3] [autodetect/20328]  
[CResourceMonitor.cc@104] Setting model memory limit to 4096 MB
[2018-05-22T13:17:40,628][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [pokus3] [autodetect/20328] 
 [CLimits.h@134] Using default value (10000) for unspecified setting autoconfig.events
[2018-05-22T13:17:40,629][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [pokus3] [autodetect/20328] 
[CLimits.h@134] Using default value (3.5) for unspecified setting results.unusualprobability
[2018-05-22T13:17:40,629][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [pokus3] [autodetect/20328] 
[CResourceMonitor.cc@104] Setting model memory limit to 10 MB
[2018-05-22T13:17:40,709][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [YDSYNs-] Successfully set 
job state to [opened] for job [pokus3]
[2018-05-22T13:17:40,849][INFO ][o.e.x.m.a.PutDatafeedAction$TransportAction] [YDSYNs-] Created 
datafeed [datafeed-pokus3]
[2018-05-22T13:17:41,108][INFO ][o.e.x.m.d.DatafeedJob    ] [pokus3] Datafeed started (from: 2018-05- 
21T11:17:30.469Z to: 2018-05-22T11:17:30.472Z) with frequency [900000ms]
[2018-05-22T13:17:41,205][ERROR][o.e.x.m.j.p.a.NativeAutodetectProcess] [pokus3] autodetect 
process stopped unexpectedly:.
[2018-05-22T13:17:41,206][INFO ][o.e.x.m.j.p.a.NativeAutodetectProcess] [pokus3] State output 
finished
[2018-05-22T13:17:41,206][WARN ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [pokus3] some results 
not processed due to the termination of autodetect
[2018-05-22T13:17:41,208][INFO ][o.e.x.m.d.DatafeedJob    ] [pokus3] Lookback has finished
[2018-05-22T13:17:41,209][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] attempt to stop datafeed 
 [datafeed-pokus3] for job [pokus3]
 [2018-05-22T13:17:41,209][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] try lock [20s] to stop 
 datafeed [datafeed-pokus3] for job [pokus3]...
 [2018-05-22T13:17:41,209][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] stopping datafeed 
[datafeed-pokus3] for job [pokus3], acquired [true]...
[2018-05-22T13:17:41,209][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] datafeed [datafeed- 
pokus3] for job [pokus3] has been stopped
[2018-05-22T13:17:41,251][ERROR][o.e.x.m.j.p.l.CppLogMessageHandler] [controller/18599] 
[CDetachedProcessSpawner.cc@245] Child process with PID 20328 was terminated by signal 4
[2018-05-22T13:17:41,368][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [YDSYNs-] Successfully set 
job state to [failed] for job [pokus3]

#4

I have run

  [root@blek elasticsearch]# plugins/x-pack/platform/linux-x86_64/bin/autodetect -                                                                                                          -version

it returns

  Model State Version 34
 Quantile State Version 3
  autodetect (64 bit): Version 6.1.1 (Build c508cf991ee61c) Copyright (c) 2017 Ela                                                                                                             
 sticsearch BV

and

 [root@blek elasticsearch]# plugins/x-pack/platform/linux-x86_64/bin/controller --version

returns

controller (64 bit): Version 6.1.1 (Build c508cf991ee61c) Copyright (c) 2017 Elasticsearch BV

And

[root@blek elasticsearch]# ps -ef |grep controller
elastic+ 18599 18527  0 12:48 ?        00:00:09 /usr/share/elasticsearch/plugins/x-pack/platf                                                                                                
orm/linux-x86_64/bin/controller

(David Roberts) #5

Signal 4 is SIGILL, which makes me wonder if the CPU in the machine you’re running ML on is too old to support the instruction sets we’re using.

We’re using SSE4.2. There’s a list of the CPUs that support it here: https://en.wikipedia.org/wiki/SSE4#Supporting_CPUs

Is your CPU listed as supporting SSE4.2 (or newer than the ones on the list)?

If not, would you be able to add a newer machine to your cluster to use as an ML node? You could set node.ml to false on all the other nodes to ensure ML jobs only get run on the newer machine.


ML Job failed: autodetect process stopped unexpectedly: Fatal error
#6

Ah, yes, that's the problem. In the future we will be new one. :smiley:

Thank you for your help!


(Mark Walkom) #7