[2021-01-05T06:57:16,374][INFO ][org.opensaml.saml.metadata.resolver.impl.AbstractReloadingMetadataResolver] [instance-0000000002] Metadata Resolver FilesystemMetadataResolver cloud-saml-kibana-38f9e1f755ca418d949a169ec8b0d3e3: New metadata successfully loaded for '/app/config/cloud-saml-metadata-38f9e1f755ca418d949a169ec8b0d3e3.xml'
[2021-01-05T06:57:16,375][INFO ][org.opensaml.saml.metadata.resolver.impl.AbstractReloadingMetadataResolver] [instance-0000000002] Metadata Resolver FilesystemMetadataResolver cloud-saml-kibana-38f9e1f755ca418d949a169ec8b0d3e3: Next refresh cycle for metadata provider '/app/config/cloud-saml-metadata-38f9e1f755ca418d949a169ec8b0d3e3.xml' will occur on '2021-01-06T06:57:16.373Z' ('2021-01-06T06:57:16.373Z' local time)
[2021-01-05T06:57:22,146][INFO ][org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager] [instance-0000000002] Opening job [teste_2]
[2021-01-05T06:57:23,829][INFO ][org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager] [instance-0000000002] [teste_2] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2021-01-05T06:57:26,431][INFO ][org.elasticsearch.xpack.security.authz.store.FileRolesStore] [instance-0000000002] parsed [15] roles from file [/app/config/roles.yml]
[2021-01-05T06:57:41,918][ERROR][org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory] [instance-0000000002] Failed to connect to autodetect for job teste_2
[2021-01-05T06:57:42,632][WARN ][org.elasticsearch.persistent.AllocatedPersistentTask] [instance-0000000002] task job-teste_2 failed with an exception
org.elasticsearch.ElasticsearchException: Failed to connect to autodetect for job teste_2
at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:55) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createAutodetectProcess(NativeAutodetectProcessFactory.java:105) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.create(AutodetectProcessManager.java:540) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.createProcessAndSetRunning(AutodetectProcessManager.java:491) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.access$200(AutodetectProcessManager.java:97) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager$2.doRun(AutodetectProcessManager.java:412) ~[?:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) ~[elasticsearch-7.10.1.jar:7.10.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.1.jar:7.10.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: java.io.FileNotFoundException: /tmp/elasticsearch-6577522640637707193/autodetect_teste_2_log_192 (No such file or directory)
at java.io.FileInputStream.open0(Native Method) ~[?:?]
at java.io.FileInputStream.open(FileInputStream.java:211) ~[?:?]
at java.io.FileInputStream.<init>(FileInputStream.java:153) ~[?:?]
at java.io.FileInputStream.<init>(FileInputStream.java:108) ~[?:?]
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:288) ~[?:?]
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:277) ~[?:?]
at java.security.AccessController.doPrivileged(AccessController.java:312) ~[?:?]
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:130) ~[?:?]
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:97) ~[?:?]
at org.elasticsearch.xpack.ml.process.ProcessPipes.connectLogStream(ProcessPipes.java:139) ~[?:?]
at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.start(AbstractNativeProcess.java:86) ~[?:?]
at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.start(AbstractNativeProcess.java:137) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createAutodetectProcess(NativeAutodetectProcessFactory.java:95) ~[?:?]
... 9 more
[2021-01-05T06:57:46,423][INFO ][org.opensaml.saml.metadata.resolver.impl.AbstractReloadingMetadataResolver] [instance-0000000002] Metadata Resolver FilesystemMetadataResolver cloud-saml-kibana-38f9e1f755ca418d949a169ec8b0d3e3: New metadata successfully loaded for '/app/config/cloud-saml-metadata-38f9e1f755ca418d949a169ec8b0d3e3.xml'
[2021-01-05T06:57:46,424][INFO ][org.opensaml.saml.metadata.resolver.impl.AbstractReloadingMetadataResolver] [instance-0000000002] Metadata Resolver FilesystemMetadataResolver cloud-saml-kibana-38f9e1f755ca418d949a169ec8b0d3e3: Next refresh cycle for metadata provider '/app/config/cloud-saml-metadata-38f9e1f755ca418d949a169ec8b0d3e3.xml' will occur on '2021-01-06T06:57:46.423Z' ('2021-01-06T06:57:46.423Z' local time)
Just to be absolutely clear, you created the cluster the same day you saw this problem? I can think of a reason why you might have seen this problem if "just created" means you created the cluster 10+ days ago, did nothing with it for 10+ days, then tried using it today (so "just created" meant it was empty rather than it was created same day). Obviously that would be a bug in ECE if that's the reason, but it's crucial to rule this idea in or out before moving on to more complex possibilities.
Assuming it's not that then we need to see more logs. Are there any log messages from around the time of the problem containing the word "controller"? For example, maybe something like this:
[2020-12-10T21:01:58,183][ERROR][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000003] [controller/540] [CDetachedProcessSpawner.cc@184] Child process with PID 1777 was terminated by signal 9
If not then the next step is to try again with a higher logging level enabled:
Then try the test again and send us every log message from around the time of the test that is from the org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler logger.
Yes, at another time, when trying to create an ML job for log anomalies by Kibana itself, this error happens. See below:
[2021-01-05T07:19:30,757][ERROR][org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory] [instance-0000000002] Failed to connect to autodetect for job kibana-logs-ui-default-default-log-entry-rate
[2021-01-05T07:19:31,181][WARN ][org.elasticsearch.persistent.AllocatedPersistentTask] [instance-0000000002] task job-kibana-logs-ui-default-default-log-entry-rate failed with an exception
org.elasticsearch.ElasticsearchException: Failed to connect to autodetect for job kibana-logs-ui-default-default-log-entry-rate
at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:55) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createAutodetectProcess(NativeAutodetectProcessFactory.java:105) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.create(AutodetectProcessManager.java:540) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.createProcessAndSetRunning(AutodetectProcessManager.java:491) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.access$200(AutodetectProcessManager.java:97) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager$2.doRun(AutodetectProcessManager.java:412) ~[?:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) ~[elasticsearch-7.10.1.jar:7.10.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.1.jar:7.10.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: java.io.FileNotFoundException: /tmp/elasticsearch-6577522640637707193/autodetect_kibana-logs-ui-default-default-log-entry-rate_log_192 (No such file or directory)
at java.io.FileInputStream.open0(Native Method) ~[?:?]
at java.io.FileInputStream.open(FileInputStream.java:211) ~[?:?]
at java.io.FileInputStream.<init>(FileInputStream.java:153) ~[?:?]
at java.io.FileInputStream.<init>(FileInputStream.java:108) ~[?:?]
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:288) ~[?:?]
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:277) ~[?:?]
at java.security.AccessController.doPrivileged(AccessController.java:312) ~[?:?]
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:130) ~[?:?]
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:97) ~[?:?]
at org.elasticsearch.xpack.ml.process.ProcessPipes.connectLogStream(ProcessPipes.java:139) ~[?:?]
at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.start(AbstractNativeProcess.java:86) ~[?:?]
at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.start(AbstractNativeProcess.java:137) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createAutodetectProcess(NativeAutodetectProcessFactory.java:95) ~[?:?]
... 9 more
[2021-01-05T07:19:46,927][WARN ][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [controller/363] [CDetachedProcessSpawner.cc@193] Child process with PID 406 has exited with exit code 1
[2021-01-05T07:23:58,443][INFO ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [instance-0000000002] [gc][1866] overhead, spent [302ms] collecting in the last [1s]
I tried to enable a high level of log, however it presents the following error
"type" : "illegal_argument_exception",
"reason" : "transient setting [org.elasticsearch.xpack.ml.process.logging], not recognized"
[2021-01-05T07:19:46,927][WARN ][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [controller/363] [CDetachedProcessSpawner.cc@193] Child process with PID 406 has exited with exit code 1
OK, that narrows it down a bit, but sadly not much. These problems are very hard to solve because the autodetect process logging goes via Elasticsearch so if for some reason autodetect exits or crashes before it connects to Elasticsearch then there is no logging to say why. I will think of some possible reasons for the failure and get back to you.
[2021-01-05T14:03:41,912][INFO ][org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager] [instance-0000000002] Opening job [test]
[2021-01-05T14:03:42,021][INFO ][org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager] [instance-0000000002] [test] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2021-01-05T14:03:54,260][ERROR][org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory] [instance-0000000002] Failed to connect to autodetect for job test
[2021-01-05T14:03:55,007][WARN ][org.elasticsearch.persistent.AllocatedPersistentTask] [instance-0000000002] task job-test failed with an exception
org.elasticsearch.ElasticsearchException: Failed to connect to autodetect for job test
at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:55) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createAutodetectProcess(NativeAutodetectProcessFactory.java:105) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.create(AutodetectProcessManager.java:540) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.createProcessAndSetRunning(AutodetectProcessManager.java:491) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.access$200(AutodetectProcessManager.java:97) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager$2.doRun(AutodetectProcessManager.java:412) ~[?:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) ~[elasticsearch-7.10.1.jar:7.10.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.1.jar:7.10.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: java.io.FileNotFoundException: /tmp/elasticsearch-6577522640637707193/autodetect_test_log_192 (No such file or directory)
at java.io.FileInputStream.open0(Native Method) ~[?:?]
at java.io.FileInputStream.open(FileInputStream.java:211) ~[?:?]
at java.io.FileInputStream.<init>(FileInputStream.java:153) ~[?:?]
at java.io.FileInputStream.<init>(FileInputStream.java:108) ~[?:?]
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:288) ~[?:?]
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:277) ~[?:?]
at java.security.AccessController.doPrivileged(AccessController.java:312) ~[?:?]
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:130) ~[?:?]
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:97) ~[?:?]
at org.elasticsearch.xpack.ml.process.ProcessPipes.connectLogStream(ProcessPipes.java:139) ~[?:?]
at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.start(AbstractNativeProcess.java:86) ~[?:?]
at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.start(AbstractNativeProcess.java:137) ~[?:?]
at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createAutodetectProcess(NativeAutodetectProcessFactory.java:95) ~[?:?]
... 9 more
[2021-01-05T14:04:00,321][DEBUG][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [controller/363] [CDetachedProcessSpawner.cc@291] Spawned './autodetect' with PID 411
[2021-01-05T14:04:10,443][WARN ][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [controller/363] [CDetachedProcessSpawner.cc@193] Child process with PID 411 has exited with exit code 1
In addition, I had to do a zero reinstallation of the ECE due to a disk problem on my server. Before it was working perfectly. I followed that documentation.
The only difference from the prerequisites is the Docker Version:
Client: Docker Engine - Community
Version: 20.10.1
API version: 1.40
Go version: go1.13.15
Git commit: 831ebea
Built: Tue Dec 15 04:34:30 2020
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 19.03.14
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 5eb3275d40
Built: Tue Dec 1 19:18:24 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.18.0
GitCommit: fec3683
One possibility here is that the machine is running so slowly that it takes more than 10 seconds for the autodetect process to create the logging named pipe after startup. We have seen that happen when a virus checker is installed that slows down process startup a lot. It could possibly also happen if the allocator is completely overloaded by any process though. There is a setting xpack.ml.process_connect_timeout that could be increased from its default of 10s to something much larger like 100s to see if something like this is happening.
Do you have enough access to the ECE allocator to check what's on the Docker container disks when this problem occurs? If you do it would be interesting to know if what is missing is just the file /tmp/elasticsearch-6577522640637707193/autodetect_test_log_192 or the whole directory /tmp/elasticsearch-6577522640637707193.
Should I change the elasticsearch.yml settings for the elasticsearch node or the dedicated ML node? As the installation was done via ECE I need a dedicated ML node.
There was no such directory. I created it with permissions for the elastic user, as well as the file manually and the error persists.
Should I change the elasticsearch.yml settings for the elasticsearch node or the dedicated ML node? As the installation was done via ECE I need a dedicated ML node.
I made the persistent configurations in the cluster settings, as below
Yes, that's the correct way to do it. As you found out, xpack.ml.process_connect_timeout is a dynamic cluster-wide setting. Any changes to the default should ideally be in cluster state, not a config file.
It seems that everything is normalized now.
That is interesting. It means that ML jobs are very slow to start up on this allocator for some reason. It implies that the time taken to get from running the autodetect program to executing this line of code is greater than 10 seconds. Usually, it would only be a few milliseconds. I suspect the long delay is not actually in the code of the main() function, but in some other OS activity that takes place before main() is called.
Some things we have observed to cause such long startup times are:
When starting the process involves accessing a remote disk that stalls for a long time, for example an AWS EBS volume where a block that gets accessed has to be copied from a snapshot on first access leading to a very long stall
Virus checkers that scan the autodetect process on startup, and access the internet when they do this
Very heavily overloaded machines, where for example a lot of swapping is taking place causing disk thrashing
There might be a completely different reason in your case. But it's worth being aware that this unexpected slowness might be a sign of some other problem you will suffer from in the future.
Everything was very well understood and clarified.
Something I remembered, I configured the Elastic Agent to enable the collection of the Endpoint. Do you believe that something may be happening here too?
Something I remembered, I configured the Elastic Agent to enable the collection of the Endpoint. Do you believe that something may be happening here too?
Yes, that could be the reason. We found on one of our internal QA systems that enabling file events in Elastic Agent resulted in the machine getting slowed down enormously when there was heavy activity on the Elasticsearch data directory. This was on a machine hosting a single node, but I imagine it would be even worse for an ECE allocator, as that's then potentially hosting many Elasticsearch data directories. We disabled file events for Agent in Fleet to resolve this.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.