Machine Learning Real-Time Job stopping after initial run


(Jonathan Spino) #1

Hello,

I am working to create some real-time machine learning jobs and I am having a problem getting the jobs to continue running after their initial analysis. I have looked through some similar postings for troubleshooting but I have not yet found what is causing the issue.

After creation of a job (single or multiple metric) the job runs successfully and results are produced. I have checked the box to continue the job in real-time but after a short period I receive a job message stating "Datafeed has been retrieving no data for a while".

I have checked the processes and it appears that both the autodetect and controller processes are still running but I believe the autodetect process may be stalled. If I run gstack 3 times on the autodetect process the resulting files are identical. The output of S1 is below.

> #0  0x00007fa03e45049d in read () from /lib64/libpthread.so.0
> #1  0x00007fa03a16d89f in boost::iostreams::detail::file_descriptor_impl::read(char*, long) () from /opt/elasticsearch-5.5.1/plugins/x-pack/platform/linux-x86_64/bin/../lib/./l
> ibboost_iostreams-gcc62-mt-1_62.so.1.62.0
> #2  0x00007fa03d6f4919 in boost::iostreams::detail::indirect_streambuf<boost::iostreams::file_descriptor_source, std::char_traits<char>, std::allocator<char>, boost::iostreams::input_seekable>::underflow() () from /opt/elasticsearch-5.5.1/plugins/x-pack/platform/linux-x86_64/bin/../lib/libMlCore.so
> #3  0x00007fa03c0b5da5 in std::basic_streambuf<char, std::char_traits<char> >::xsgetn(char*, long) () from /opt/elasticsearch-5.5.1/plugins/x-pack/platform/linux-x86_64/bin/../lib/libstdc++.so.6
> #4  0x00007fa03c09025b in std::istream::read(char*, long) () from /opt/elasticsearch-5.5.1/plugins/x-pack/platform/linux-x86_64/bin/../lib/libstdc++.so.6
> #5  0x00007fa03c3f7052 in ml::api::CLengthEncodedInputParser::refillBuffer() () from /opt/elasticsearch-5.5.1/plugins/x-pack/platform/linux-x86_64/bin/../lib/libMlApi.so
> #6  0x00007fa03c3f7320 in ml::api::CLengthEncodedInputParser::parseStringFromStream(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) () from /opt/elasticsearch-5.5.1/plugins/x-pack/platform/linux-x86_64/bin/../lib/libMlApi.so
> #7  0x00007fa03c3f8730 in bool ml::api::CLengthEncodedInputParser::parseRecordFromStream<false, std::vector<boost::reference_wrapper<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<boost::reference_wrapper<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >(std::vector<boost::reference_wrapper<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<boost::reference_wrapper<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) () from /opt/elasticsearch-5.5.1/plugins/x-pack/platform/linux-x86_64/bin/../lib/libMlApi.so
> #8  0x00007fa03c3f76e3 in ml::api::CLengthEncodedInputParser::readStream(boost::function2<bool, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, boost::unordered::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&> const&) () from /opt/elasticsearch-5.5.1/plugins/x-pack/platform/linux-x86_64/bin/../lib/libMlApi.so
> #9  0x00007fa03c3a7751 in ml::api::CCmdSkeleton::ioLoop() () from /opt/elasticsearch-5.5.1/plugins/x-pack/platform/linux-x86_64/bin/../lib/libMlApi.so
> #10 0x000000000040b031 in ?? ()
> #11 0x00007fa03b6d1b35 in __libc_start_main () from /lib64/libc.so.6
> #12 0x0000000000409569 in ?? ()
> #13 0x00007ffeacb6b688 in ?? ()
> #14 0x000000000000001c in ?? ()
> #15 0x0000000000000015 in ?? ()
> #16 0x00007ffeacb6bd3e in ?? ()
> #17 0x00007ffeacb6bd4b in ?? ()
> #18 0x00007ffeacb6bd61 in ?? ()
> #19 0x00007ffeacb6bd85 in ?? ()
> #20 0x00007ffeacb6bd96 in ?? ()
> #21 0x00007ffeacb6bdab in ?? ()
> #22 0x00007ffeacb6bdc3 in ?? ()
> #23 0x00007ffeacb6bdda in ?? ()
> #24 0x00007ffeacb6bdf2 in ?? ()
> #25 0x00007ffeacb6be0e in ?? ()
> #26 0x00007ffeacb6be59 in ?? ()
> #27 0x00007ffeacb6be6c in ?? ()
> #28 0x00007ffeacb6bea3 in ?? ()
> #29 0x00007ffeacb6bed4 in ?? ()
> #30 0x00007ffeacb6bf05 in ?? ()
> #31 0x00007ffeacb6bf13 in ?? ()
> #32 0x00007ffeacb6bf46 in ?? ()
> #33 0x00007ffeacb6bf55 in ?? ()
> #34 0x00007ffeacb6bf8a in ?? ()
> #35 0x00007ffeacb6bf9a in ?? ()
> #36 0x00007ffeacb6bfcf in ?? ()
> #37 0x0000000000000000 in ?? ()

The commands below both return successfully without error.
$ES_HOME/plugins/x-pack/platform/linux-x86_64/bin/controller --version
$ES_HOME/plugins/x-pack/platform/linux-x86_64/bin/autodetect --version

Thank you for taking a look. let me know if there is any more information that I can provide, I am out of characters so I will post a reply with additional information.


(Jonathan Spino) #2

The only other suggestion I have come across is an incorrect timestamp on the log messages. As far as I can tell all of the logs have a current time set, they are imported using a logstash job (JDBC Query) that runs once every 5 minutes. There are no errors or warnings in the ES logs but below is the full logs related to a real-time job (spino)

> [root@mbatles1 elasticsearch-5.5.1]# less logs/mbatles-ml.log | grep spino
> [2017-08-09T19:39:24,676][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [mbatles-ml1] Opening job [spino]
> [2017-08-09T19:39:24,678][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [mbatles-ml1] [spino] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
> [2017-08-09T19:39:24,825][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [mbatles-ml1] Successfully set job state to [opened] for job [spino]
> [2017-08-09T19:39:24,887][INFO ][o.e.x.m.a.PutDatafeedAction$TransportAction] [mbatles-ml1] Created datafeed [datafeed-spino]
> [2017-08-09T19:39:24,955][INFO ][o.e.x.m.d.DatafeedManager] Starting datafeed [datafeed-spino] for job [spino] in [2017-08-07T18:20:00.286Z, 2017-08-09T22:35:03.693Z)
> [2017-08-09T19:39:25,130][INFO ][o.e.x.m.j.p.DataCountsReporter] [mbatles-ml1] [spino] 10000 records written to autodetect; missingFieldCount=10004, invalidDateCount=0, outOfOrderCount=0
> [2017-08-09T19:39:25,230][INFO ][o.e.x.m.j.p.DataCountsReporter] [mbatles-ml1] [spino] 20000 records written to autodetect; missingFieldCount=20006, invalidDateCount=0, outOfOrderCount=0
> ...(redundant data removed)...
> [2017-08-09T19:39:32,208][INFO ][o.e.x.m.j.p.DataCountsReporter] [mbatles-ml1] [spino] 400000 records written to autodetect; missingFieldCount=400530, invalidDateCount=0, outOfOrderCount=0
> [2017-08-09T19:39:34,065][INFO ][o.e.x.m.j.p.DataCountsReporter] [mbatles-ml1] [spino] 500000 records written to autodetect; missingFieldCount=500578, invalidDateCount=0, outOfOrderCount=0
> [2017-08-09T19:39:35,936][INFO ][o.e.x.m.j.p.DataCountsReporter] [mbatles-ml1] [spino] 600000 records written to autodetect; missingFieldCount=600607, invalidDateCount=0, outOfOrderCount=0
> [2017-08-09T19:39:39,231][INFO ][o.e.x.m.j.p.DataCountsReporter] [mbatles-ml1] [spino] 700000 records written to autodetect; missingFieldCount=700938, invalidDateCount=0, outOfOrderCount=0
> [2017-08-09T19:39:41,056][INFO ][o.e.x.m.j.p.DataCountsReporter] [mbatles-ml1] [spino] 800000 records written to autodetect; missingFieldCount=800964, invalidDateCount=0, outOfOrderCount=0
> [2017-08-09T19:39:42,800][INFO ][o.e.x.m.j.p.DataCountsReporter] [mbatles-ml1] [spino] 900000 records written to autodetect; missingFieldCount=901004, invalidDateCount=0, outOfOrderCount=0
> [2017-08-09T19:39:44,918][INFO ][o.e.x.m.j.p.DataCountsReporter] [mbatles-ml1] [spino] 1000000 records written to autodetect; missingFieldCount=1001042, invalidDateCount=0, outOfOrderCount=0
> [2017-08-09T19:39:45,450][INFO ][o.e.x.m.d.DatafeedJob    ] [spino] Lookback has finished
> [2017-08-09T19:39:45,450][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] attempt to stop datafeed [datafeed-spino] for job [spino]
> [2017-08-09T19:39:45,450][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] try lock [20s] to stop datafeed [datafeed-spino] for job [spino]...
> [2017-08-09T19:39:45,450][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] stopping datafeed [datafeed-spino] for job [spino], acquired [true]...
> [2017-08-09T19:39:45,450][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] datafeed [datafeed-spino] for job [spino] has been stopped
> [2017-08-09T19:39:46,053][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [mbatles-ml1] Closing job [spino], because [close job (api)]
> [2017-08-09T19:39:46,055][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [spino] [autodetect/8154] [CCmdSkeleton.cc@70] Handled 1024791 records
> [2017-08-09T19:39:46,055][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [spino] [autodetect/8154] [CAnomalyDetector.cc@1704] Pruning all models
> [2017-08-09T19:39:46,222][INFO ][o.e.x.m.j.p.a.NativeAutodetectProcess] [spino] State output finished
> [2017-08-09T19:39:46,227][INFO ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [spino] 628 buckets parsed from autodetect output
> [2017-08-09T19:39:46,665][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [spino] job closed
> [2017-08-09T19:39:56,922][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [mbatles-ml1] Opening job [spino]
> [2017-08-09T19:39:56,925][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [mbatles-ml1] [spino] Loading model snapshot [1502321986] with latest_record_timestamp [2017-08-09T22:35:03.000Z], job latest_record_timestamp [2017-08-09T22:35:03.692Z]
> [2017-08-09T19:39:56,925][INFO ][o.e.x.m.j.p.a.NativeAutodetectProcessFactory] Restoring quantiles for job 'spino'
> [2017-08-09T19:39:57,052][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [spino] [autodetect/8207] [CAnomalyDetector.cc@981] Processing is already complete to time 1502318100
> [2017-08-09T19:39:57,162][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [mbatles-ml1] Successfully set job state to [opened] for job [spino]
> [2017-08-09T19:39:57,244][INFO ][o.e.x.m.d.DatafeedManager] Starting datafeed [datafeed-spino] for job [spino] in [1970-01-01T00:00:00.000Z, forever)
> [2017-08-09T19:42:30,132][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [spino] [autodetect/8207] [CAnomalyDetector.cc@1704] Pruning all models
> [2017-08-09T19:42:30,135][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [spino] [autodetect/8207] [CAnomalyDetector.cc@1271] Background persist starting data copy
> [2017-08-09T19:42:30,135][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [spino] [autodetect/8207] [CAnomalyDetector.cc@1257] Background persist starting background thread
> [2017-08-09T19:42:31,232][INFO ][o.e.x.m.j.UpdateJobProcessNotifier] [mbatles-ml1] Successfully updated remote job [spino]

(Dimitris Athanasiou) #3

Hi Jonathan,

The datafeed has 2 parameters that determine when it is querying for data: frequency and query_delay. You can read more about them in create-datafeed-API. In a nutshell, frequency is how often the feed queries elasticsearch for data and query_delay allows time for data to be indexed in elasticsearch by never querying for data after now - query_delay.

I suspect that in your case data is not available for search yet by the time the datafeed queries for them. As it seems you have polled data (every 5 minutes), I recommend setting frequency to 5 minutes and see what happens. If you still get no data, I suggest adjusting query_delay in order to find the right delay that it takes for you data to be indexed in elasticsearch.


(Jonathan Spino) #4

Hey Dmitri,

Thanks for the suggestions, it is right along the lines of what I was thinking. Unfortunately, I tried going as high as 600s for frequency and 360s for query_delay and the jobs still do not analyze any new data after their initial run. This is running in a single node POC environment with ES and Kibana running 5.5.1. Please let me know if there is any more information I can provide.

2017-08-10 13:59:44 mbatles-ml1 Datafeed started (from: 2017-08-10T16:55:00.038Z to: real-time)
2017-08-10 15:40:00 mbatles-ml1 Datafeed has been retrieving no data for a while
2017-08-11 02:09:03 mbatles-ml1 Datafeed stopped

datafeed_iddatafeed-product11111
job_idproduct11111
query_delay360s
frequency600s
indices: mb*
types: logs
query{"match_all":{"boost":1}}
scroll_size1000
chunking_config{"mode":"auto"}
statestopped

thank you,
Jonathan


(Dimitris Athanasiou) #5

Hi Jonathan

Another reason why this could be happening is that your time field has no timezone information making those timestamps to be either in the past or future, as elasticsearch treats timestamps without timezone as UTC. Could you post an example document from the mb* indices and indicate the time field you are using in the ML jobs please?


(Jonathan Spino) #6

Dmitri,

I was attempting these ML jobs on a POC node with a small subset of our data. I upgraded our main cluster to 5.5.1 and the same jobs run without issue there.

Just for educational purposes, the data on the POC node was being collected with a SQL Query running in logstash 2.4 and being sent to ES 5.5.1. Below is an example document, datetimestamp is being used as the time field.

{
  "_index": "mb-mldata",
  "_type": "logs",
  "_id": "AV3U_f1IBkyW9DuA3CUL",
  "_version": 1,
  "_score": null,
  "_source": {
    "member_id": "****************",
    "datetimestamp": "2017-08-12T04:49:59.700Z",
    "bureau": "**",
    "product": "*********",
    "resp_fname": "*****************",
    "resp_lname": "***********",
    "resp_ssn": "**************",
    "status": "OK",
    "app_id": "{*****************************}",
    "billing_info": 0,
    "reference": "********",
    "requestedby": "MBRVDWorkflow",
    "transid": null,
    "pricing_code": null,
    "@version": "1",
    "@timestamp": "2017-08-12T04:50:00.251Z",
    "message": "Audit Reply Query",
    "logsource": "Audit Reply"
  },
  "fields": {
    "datetimestamp": [
      1502513399700
    ],
    "@timestamp": [
      1502513400251
    ]
  },
  "sort": [
    1502513399700
  ]
}

(Dimitris Athanasiou) #7

Hi Jonathan,

I am glad you've got your use case working properly.

I am still curious about what was going on. I have to say that my assumption would be that data was not available to be fetched by the datafeed. Did you conclude what the issue was?


(Jonathan Spino) #8

Dimitri,

Sorry for the delay in response, I dug into ML on my production node and got caught up. In case anyone else finds this thread, the issue in my case was that in my production environment the JDBC job is sending the data into Graylog where the date/time is being updated to reflect the time the report was generated rather than when the report was collected. In the POC environment, the data was being sent directly from logstash to ES and the date change was not being made. This caused all of the events to appear in groups which I believe through off the ML watcher. I'm now using my production data that has been modified by Graylog and the jobs run correctly.

Thank you for your help,
Jonathan


(Dimitris Athanasiou) #9

Thank you for taking the time to explain the issue :slight_smile:


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.