Machine Learning with unstructured log data freezes

Myles · May 16, 2017, 11:18pm

I have been using the new machine learning feature of 5.4 with great success with apache access logs and system metrics and wanted to try it out on my tomcat logs. I have about 50 tomcat applications all logging in the same format and indexing to the same indexes. The "message" field of each event has been overwritten in a grok parse to be only the actual message left over after parsing the timestamp, log level, java class, etc.

I originally couldn't start the job until I added "_source": true to the datafeed. After that, I was able to start the job but no matter what I do, it freezes after only a few buckets. Furthermore, after it freezes it cannot be stopped, closed, or deleted until I restart elasticsearch on the ML node it was assigned to.

Here is my job config:

{
"job_id": "log-tomcat",
"job_type": "anomaly_detector",
"description": "Abnormal log events per tomcat container",
"analysis_config": {
"bucket_span": "30m",
"categorization_field_name": "message",
"detectors": [
{
"detector_description": "ML Categories (log-tomcat)",
"function": "count",
"by_field_name": "mlcategory",
"partition_field_name": "container",
"detector_rules":
}
],
"influencers": [
"host",
"level",
"logger"
]
},
"data_description": {
"time_field": "@timestamp"
},
"model_snapshot_retention_days": 1,
"datafeed_config": {
"query_delay": "150s",
"frequency": "60s",
"indexes": [
"message"
],
"types": [
"tomcat"
],
"query": {
"bool": {
"must": [
{
"term": {
"environment": {
"value": "production",
"boost": 1
}
}
},
{
"term": {
"service": {
"value": "tomcat",
"boost": 1
}
}
}
],
"disable_coord": false,
"adjust_pure_negative": true,
"boost": 1
}
},
"scroll_size": 1000,
"_source": true,
"chunking_config": {
"mode": "auto"
}
}
}

Here is the JSON output of the job while it is frozen:

{
"job_id": "log-tomcat",
"job_type": "anomaly_detector",
"description": "Abnormal log events per tomcat container",
"create_time": 1494974221746,
"analysis_config": {
"bucket_span": "30m",
"categorization_field_name": "message",
"detectors": [
{
"detector_description": "ML Categories (log-tomcat)",
"function": "count",
"by_field_name": "mlcategory",
"partition_field_name": "container",
"detector_rules":
}
],
"influencers": [
"host",
"level",
"logger"
]
},
"data_description": {
"time_field": "@timestamp",
"time_format": "epoch_ms"
},
"model_snapshot_retention_days": 1,
"results_index_name": "shared",
"data_counts": {
"job_id": "log-tomcat",
"processed_record_count": 7728,
"processed_field_count": 38639,
"input_bytes": 2106764,
"input_field_count": 38640,
"invalid_date_count": 0,
"missing_field_count": 7729,
"out_of_order_timestamp_count": 0,
"empty_bucket_count": 0,
"sparse_bucket_count": 0,
"bucket_count": 14,
"earliest_record_timestamp": 1494734400400,
"latest_record_timestamp": 1494748795323,
"last_data_time": 1494974263892,
"input_record_count": 7728
},
"model_size_stats": {
"job_id": "log-tomcat",
"result_type": "model_size_stats",
"model_bytes": 1516800,
"total_by_field_count": 74,
"total_over_field_count": 0,
"total_partition_field_count": 12,
"bucket_allocation_failures_count": 0,
"memory_status": "ok",
"log_time": 1494974263000,
"timestamp": 1494745200000
},
"datafeed_config": {
"datafeed_id": "datafeed-log-tomcat",
"job_id": "log-tomcat",
"query_delay": "150s",
"frequency": "60s",
"indexes": [
"message"
],
"types": [
"tomcat"
],
"query": {
"bool": {
"must": [
{
"term": {
"environment": {
"value": "production",
"boost": 1
}
}
},
{
"term": {
"service": {
"value": "tomcat",
"boost": 1
}
}
}
],
"disable_coord": false,
"adjust_pure_negative": true,
"boost": 1
}
},
"scroll_size": 1000,
"_source": true,
"chunking_config": {
"mode": "auto"
},
"state": "started"
},
"state": "opened",
"node": {
"id": "syyOmfkhQ5qZ0BTuFoPWPA",
"name": "elasticsearch01",
"ephemeral_id": "utQPqCCdTayloJTBG5avcA",
"transport_address": "10.255.72.128:9300",
"attributes": {
"ml.enabled": "true"
}
},
"open_time": "378s"
}

Not sure what I'm doing wrong as I haven't heard of anyone else having this issue with unstructured log ML jobs. Thoughts?

droberts195 · May 17, 2017, 9:30am

Hi,

It would be helpful if you could check whether the job is freezing or is in a busy spin.

Each ML job runs a process called autodetect that does the analytics.

If you're on Mac or Linux please use the top command to see if the autodetect process associated with the job is using 100% CPU while it's "frozen". If you have lots of jobs running, you can use something like ps -ef | grep autodetect | grep log-tomcat to work out which autodetect process goes with this job.

If top shows that the autodetect process for the job is going into some sort of busy loop then the next stage would be to try to figure out what it's doing whilst in this state. If you're on Linux and are allowed to install the gdb package then you can use the gstack utility to tell us this. Find the PID of the autodetect for the job. Then, while it's in the "frozen" state, run gstack $PID > s1, gstack $PID > s2, gstack $PID > s3. Then post the contents of the 3 output files, as it will hopefully enable us to work out what's gone wrong.

If autodetect is not spinning at 100% CPU or you're not on Linux let us know and we'll suggest something else.

Myles · May 17, 2017, 2:15pm

I am running this in a non-prod environment with RHEL7 hosts. I re-ran the job using the same JSON as posted earlier and it promptly "froze" again after which I did the following:

elastic+  32105 100917  0 09:47 ?        00:00:00 ./autodetect --jobid=log-tomcat --licenseValidation=881105361855967 --bucketspan=1800 --lengthEncodedInput --maxAnomalyRecords=500 --timefield=@timestamp --persistInterval=13340 --maxQuantileInterval=24140 --ignoreDowntime --fieldconfig=/tmp/fieldconfig7049014473090704320.conf --logPipe=/tmp/autodetect_log-tomcat_log_100790 --input=/tmp/autodetect_log-tomcat_input_100790 --inputIsPipe --output=/tmp/autodetect_log-tomcat_output_100790 --outputIsPipe --persist=/tmp/autodetect_log-tomcat_persist_100790 --persistIsPipe
[root@elasticsearch05 ~]# gstack 32105 > s1
[root@elasticsearch05 ~]# gstack 32105 > s2
[root@elasticsearch05 ~]# gstack 32105 > s3
[root@elasticsearch05 ~]# diff -s s1 s2
Files s1 and s2 are identical
[root@elasticsearch05 ~]# diff -s s1 s3
Files s1 and s3 are identical

Here is the top output for the autodetect process:

Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.4 us,  0.2 sy,  0.0 ni, 94.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 65691860 total,   928668 free, 32763548 used, 31999644 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used. 31907568 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 32105 elastic+  20   0   91552  21736  10688 S   0.0  0.0   0:00.49 autodetect

As you can see, it appears to actually be in some kind of "frozen" state. I have let it run overnight in previous attempts with no new processed events after the first few buckets the next morning.

Myles · May 17, 2017, 2:23pm

I am just now noticing the arguments in the process cmdline attempting to use files in /tmp in which case I should mention that I am mounting /tmp as noexec per our enterprise security requirements. Is there a way to configure a different directory for these files? I have had to do this already for the jna.tmpdir in the jvm.options for elasticsearch:

-Djna.tmpdir=/var/data/elasticsearch/tmp

Is there something similar I can set for autodetect's tmpdir? Also, I do not have issues with my access log jobs which have similar arguments for autodetect but are not using the categorization_field_name and mlcategory in the detectors.

dkyle · May 17, 2017, 2:51pm

Myles,

You can change the tmp directory with
-Djava.io.tmpdir=/var/data/elasticsearch/tmp

The named pipes used to communicate withe the autodetect process are created in the tmp dir so it must be readable.

droberts195 · May 17, 2017, 3:12pm

Myles,

If changing java.io.tmpdir doesn't fix your problem, please could you paste the contents of the file s1 into a comment. That will show us where the autodetect process is stalled. However, it's possible that the reason it's stalled is that it isn't being sent any input. This could happen if something is going wrong with the datafeed that's feeding the job. Is this is what's happening then there could well be error messages in the main Elasticsearch log file (the one named after your cluster in $ES_HOME/logs). Are there errors or warnings in this log, especially any that contain the word "datafeed"?

Myles · May 17, 2017, 3:56pm

I set the java.io.tmpdir to be /var/data/elasticsearch/tmp in the jvm.properties and did a rolling restart. The job behaved the same way this time however I now have a stack trace in the cluster.log which I am trying to post but I can only do 7k characters.

Myles · May 17, 2017, 3:59pm

Here is the job output and the top 3rd of the error:

[2017-05-17T11:50:54,258][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [elasticsearch06] [log-tomcat] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2017-05-17T11:50:54,390][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [elasticsearch06] Successfully set job state to [opened] for job [log-tomcat]
[2017-05-17T11:51:00,561][INFO ][o.e.x.m.d.DatafeedManager] Starting datafeed [datafeed-log-tomcat] for job [log-tomcat] in [1970-01-01T00:00:00.000Z, forever)
[2017-05-17T11:51:01,381][INFO ][o.e.x.m.j.p.DataCountsReporter] [elasticsearch06] [log-tomcat] 10000 records written to autodetect; missingFieldCount=10000, invalidDateCount=0, outOfOrderCount=0
[2017-05-17T11:51:01,767][INFO ][o.e.x.m.j.p.DataCountsReporter] [elasticsearch06] [log-tomcat] 20000 records written to autodetect; missingFieldCount=20000, invalidDateCount=0, outOfOrderCount=0
[2017-05-17T11:51:02,156][INFO ][o.e.x.m.j.p.DataCountsReporter] [elasticsearch06] [log-tomcat] 30000 records written to autodetect; missingFieldCount=30000, invalidDateCount=0, outOfOrderCount=0
[2017-05-17T11:51:02,610][ERROR][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [log-tomcat] error parsing autodetect output
java.lang.IllegalArgumentException: Document contains at least one immense term in field="regex" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[46, 42, 63, 73, 110, 112, 117, 116, 46, 43, 63, 83, 116, 114, 101, 97, 109, 46, 43, 63, 102, 114, 111, 109, 46, 43, 63, 80, 85, 84]...', original message: bytes can be at most 32766 in length; got 151120
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:806) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:447) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:403) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1569) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1314) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:639) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:583) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:483) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:560) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:549) ~[elasticsearch-5.4.0.jar:5.4.0]

Myles · May 17, 2017, 4:00pm

Here is the gstack output:

[root@elasticsearch06 ~]# cat s1
#0  0x00007f902e41d420 in __write_nocancel () from /lib64/libpthread.so.0
#1  0x00007f902a0ad8eb in boost::iostreams::detail::file_descriptor_impl::write(char const*, long) () from /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/../lib/./libboost_iostreams-gcc62-mt-1_62.so.1.62.0
#2  0x00007f902d6c3428 in boost::iostreams::detail::indirect_streambuf<boost::iostreams::file_descriptor_sink, std::char_traits<char>, std::allocator<char>, boost::iostreams::output_seekable>::overflow(int) () from /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/../lib/libMlCore.so
#3  0x00007f902bfe658d in std::ostream::put(char) () from /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/../lib/libstdc++.so.6
#4  0x00007f902c32b67d in ?? () from /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/../lib/libMlApi.so
#5  0x00007f902c32682d in ml::api::CJsonOutputWriter::writeCategoryDefinition(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/../lib/libMlApi.so
#6  0x00007f902c30fdcd in ml::api::CFieldDataTyper::computeType(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, boost::unordered::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&) () from /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/../lib/libMlApi.so
#7  0x00007f902c31019d in ml::api::CFieldDataTyper::handleRecord(bool, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, boost::unordered::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&) () from /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/../lib/libMlApi.so
#8  0x00007f902c32e287 in ml::api::CLengthEncodedInputParser::readStream(bool, boost::function1<bool, boost::unordered::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&> const&, boost::function3<bool, bool, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, boost::unordered::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&> const&) () from /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/../lib/libMlApi.so
#9  0x00007f902c2e12e4 in ml::api::CCmdSkeleton::ioLoop() () from /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/../lib/libMlApi.so
#10 0x000000000040b12f in ?? ()
#11 0x00007f902b611b35 in __libc_start_main () from /lib64/libc.so.6
#12 0x00000000004095c9 in ?? ()
#13 0x00007ffee0778358 in ?? ()
#14 0x000000000000001c in ?? ()
#15 0x0000000000000012 in ?? ()
#16 0x00007ffee0779d50 in ?? ()
#17 0x00007ffee0779d5d in ?? ()
#18 0x00007ffee0779d70 in ?? ()
#19 0x00007ffee0779d94 in ?? ()
#20 0x00007ffee0779da6 in ?? ()
#21 0x00007ffee0779dbb in ?? ()
#22 0x00007ffee0779dd3 in ?? ()
#23 0x00007ffee0779dea in ?? ()
#24 0x00007ffee0779e02 in ?? ()
#25 0x00007ffee0779e1e in ?? ()
#26 0x00007ffee0779e2f in ?? ()
#27 0x00007ffee0779e7d in ?? ()
#28 0x00007ffee0779ec3 in ?? ()
#29 0x00007ffee0779f09 in ?? ()
#30 0x00007ffee0779f17 in ?? ()
#31 0x00007ffee0779f5f in ?? ()
#32 0x00007ffee0779f6e in ?? ()
#33 0x00007ffee0779fb8 in ?? ()
#34 0x0000000000000000 in ?? ()

Myles · May 17, 2017, 4:08pm

Here is a stack trace caused by portion:

Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: max_bytes_length_exceeded_exception: bytes can be at most 32766 in length; got 151120
	at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:263) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:149) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:796) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	... 37 more

dkyle · May 17, 2017, 4:32pm

Myles,

Thanks for the log file and the stack trace.

What is happening here is the result document when using categorization contains a regex field which is mapped keyword. The regex field enables finding the raw messages that match a category. It appears that one of the messages in your tomcat log file creates a regex that is longer than 32Kb, then elasticsearch fails to index the field.

We will raise a ticket for the issue and fix it for the next release. Apologies for the inconvenience and thank you for the valuable feedback.

As a work-around, if you can find the message from the tomcat log file that is causing the problem and it has a distinct segment that is what makes it long, you can consider using Categorization Filters in order to exclude the long segment from the categorizer.

Myles · May 17, 2017, 4:46pm

Thanks David (both Davids) for looking into this and the explanation. Given the nature of the logs and the fact that the job stalls almost immediately regardless of the start time I choose (I've tried several) I'm going to say that 32kb regex is probably just not enough for java application logs in general as they can produce long stack traces.

If I had to guess without digging into it, the primary culprit is probably my SSO CAS application which logs a very lengthy stack trace every time a user inputs the wrong credentials. But I'd be willing to wager that there are many such cases happening throughout my app logs which spans over 50 tomcat containers all in multi-node clusters and multiple environments. Enough so that it would probably defeat my purpose for doing a ML job on them if I were to filter out such log events from analysis as stack traces are usually what I'm after in most troubleshooting scenarios.

system · June 14, 2017, 4:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.