Error Machine Learning Job in Elastic Cloud Enterprise "autodetect"

I get an error creating the most basic machine learning job.

In kibana it says "datafeed-teste_2 failed to start. Unexpected job state [failed] while waiting for job to be opened."

Im using Elastic Cloud Enterprise (2.7.1) and I just created this cluster.
Red Hat Enterprise Linux 8.0
ElasticStack 7.10.1

I already tried to follow this issues but without success: Erro creating ML job and Unable to create ML job

I would like some help with this problem.

Follow the es.log:

[2021-01-05T06:57:16,374][INFO ][org.opensaml.saml.metadata.resolver.impl.AbstractReloadingMetadataResolver] [instance-0000000002] Metadata Resolver FilesystemMetadataResolver cloud-saml-kibana-38f9e1f755ca418d949a169ec8b0d3e3: New metadata successfully loaded for '/app/config/cloud-saml-metadata-38f9e1f755ca418d949a169ec8b0d3e3.xml'
        [2021-01-05T06:57:16,375][INFO ][org.opensaml.saml.metadata.resolver.impl.AbstractReloadingMetadataResolver] [instance-0000000002] Metadata Resolver FilesystemMetadataResolver cloud-saml-kibana-38f9e1f755ca418d949a169ec8b0d3e3: Next refresh cycle for metadata provider '/app/config/cloud-saml-metadata-38f9e1f755ca418d949a169ec8b0d3e3.xml' will occur on '2021-01-06T06:57:16.373Z' ('2021-01-06T06:57:16.373Z' local time)
        [2021-01-05T06:57:22,146][INFO ][org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager] [instance-0000000002] Opening job [teste_2]
        [2021-01-05T06:57:23,829][INFO ][org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager] [instance-0000000002] [teste_2] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
        [2021-01-05T06:57:26,431][INFO ][org.elasticsearch.xpack.security.authz.store.FileRolesStore] [instance-0000000002] parsed [15] roles from file [/app/config/roles.yml]
        [2021-01-05T06:57:41,918][ERROR][org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory] [instance-0000000002] Failed to connect to autodetect for job teste_2

        [2021-01-05T06:57:42,632][WARN ][org.elasticsearch.persistent.AllocatedPersistentTask] [instance-0000000002] task job-teste_2 failed with an exception
        org.elasticsearch.ElasticsearchException: Failed to connect to autodetect for job teste_2
                at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:55) ~[?:?]
                at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createAutodetectProcess(NativeAutodetectProcessFactory.java:105) ~[?:?]
                at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.create(AutodetectProcessManager.java:540) ~[?:?]
                at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.createProcessAndSetRunning(AutodetectProcessManager.java:491) ~[?:?]
                at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.access$200(AutodetectProcessManager.java:97) ~[?:?]
                at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager$2.doRun(AutodetectProcessManager.java:412) ~[?:?]
                at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) ~[elasticsearch-7.10.1.jar:7.10.1]
                at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.1.jar:7.10.1]
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
                at java.lang.Thread.run(Thread.java:832) [?:?]
        Caused by: java.io.FileNotFoundException: /tmp/elasticsearch-6577522640637707193/autodetect_teste_2_log_192 (No such file or directory)
                at java.io.FileInputStream.open0(Native Method) ~[?:?]
                at java.io.FileInputStream.open(FileInputStream.java:211) ~[?:?]
                at java.io.FileInputStream.<init>(FileInputStream.java:153) ~[?:?]
                at java.io.FileInputStream.<init>(FileInputStream.java:108) ~[?:?]
                at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:288) ~[?:?]
                at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:277) ~[?:?]
                at java.security.AccessController.doPrivileged(AccessController.java:312) ~[?:?]
                at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:130) ~[?:?]
                at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:97) ~[?:?]
                at org.elasticsearch.xpack.ml.process.ProcessPipes.connectLogStream(ProcessPipes.java:139) ~[?:?]
                at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.start(AbstractNativeProcess.java:86) ~[?:?]
                at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.start(AbstractNativeProcess.java:137) ~[?:?]
                at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createAutodetectProcess(NativeAutodetectProcessFactory.java:95) ~[?:?]
                ... 9 more
        [2021-01-05T06:57:46,423][INFO ][org.opensaml.saml.metadata.resolver.impl.AbstractReloadingMetadataResolver] [instance-0000000002] Metadata Resolver FilesystemMetadataResolver cloud-saml-kibana-38f9e1f755ca418d949a169ec8b0d3e3: New metadata successfully loaded for '/app/config/cloud-saml-metadata-38f9e1f755ca418d949a169ec8b0d3e3.xml'
        [2021-01-05T06:57:46,424][INFO ][org.opensaml.saml.metadata.resolver.impl.AbstractReloadingMetadataResolver] [instance-0000000002] Metadata Resolver FilesystemMetadataResolver cloud-saml-kibana-38f9e1f755ca418d949a169ec8b0d3e3: Next refresh cycle for metadata provider '/app/config/cloud-saml-metadata-38f9e1f755ca418d949a169ec8b0d3e3.xml' will occur on '2021-01-06T06:57:46.423Z' ('2021-01-06T06:57:46.423Z' local time)

Just to be absolutely clear, you created the cluster the same day you saw this problem? I can think of a reason why you might have seen this problem if "just created" means you created the cluster 10+ days ago, did nothing with it for 10+ days, then tried using it today (so "just created" meant it was empty rather than it was created same day). Obviously that would be a bug in ECE if that's the reason, but it's crucial to rule this idea in or out before moving on to more complex possibilities.

Assuming it's not that then we need to see more logs. Are there any log messages from around the time of the problem containing the word "controller"? For example, maybe something like this:

[2020-12-10T21:01:58,183][ERROR][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000003] [controller/540] [CDetachedProcessSpawner.cc@184] Child process with PID 1777 was terminated by signal 9

If not then the next step is to try again with a higher logging level enabled:

PUT _cluster/settings
{
    "transient" : {
        "org.elasticsearch.xpack.ml.process.logging" : "DEBUG"
    }
}

Then try the test again and send us every log message from around the time of the test that is from the org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler logger.

Hi David, thank you for the quick response.

Yes, exactly on the same day, a few hours after.

Yes, at another time, when trying to create an ML job for log anomalies by Kibana itself, this error happens. See below:

[2021-01-05T07:19:30,757][ERROR][org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory] [instance-0000000002] Failed to connect to autodetect for job kibana-logs-ui-default-default-log-entry-rate
[2021-01-05T07:19:31,181][WARN ][org.elasticsearch.persistent.AllocatedPersistentTask] [instance-0000000002] task job-kibana-logs-ui-default-default-log-entry-rate failed with an exception
org.elasticsearch.ElasticsearchException: Failed to connect to autodetect for job kibana-logs-ui-default-default-log-entry-rate
        at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:55) ~[?:?]
        at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createAutodetectProcess(NativeAutodetectProcessFactory.java:105) ~[?:?]
        at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.create(AutodetectProcessManager.java:540) ~[?:?]
        at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.createProcessAndSetRunning(AutodetectProcessManager.java:491) ~[?:?]
        at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.access$200(AutodetectProcessManager.java:97) ~[?:?]
        at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager$2.doRun(AutodetectProcessManager.java:412) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.1.jar:7.10.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: java.io.FileNotFoundException: /tmp/elasticsearch-6577522640637707193/autodetect_kibana-logs-ui-default-default-log-entry-rate_log_192 (No such file or directory)
        at java.io.FileInputStream.open0(Native Method) ~[?:?]
        at java.io.FileInputStream.open(FileInputStream.java:211) ~[?:?]
        at java.io.FileInputStream.<init>(FileInputStream.java:153) ~[?:?]
        at java.io.FileInputStream.<init>(FileInputStream.java:108) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:288) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:277) ~[?:?]
        at java.security.AccessController.doPrivileged(AccessController.java:312) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:130) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:97) ~[?:?]
        at org.elasticsearch.xpack.ml.process.ProcessPipes.connectLogStream(ProcessPipes.java:139) ~[?:?]
        at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.start(AbstractNativeProcess.java:86) ~[?:?]
        at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.start(AbstractNativeProcess.java:137) ~[?:?]
        at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createAutodetectProcess(NativeAutodetectProcessFactory.java:95) ~[?:?]
        ... 9 more
[2021-01-05T07:19:46,927][WARN ][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [controller/363] [CDetachedProcessSpawner.cc@193] Child process with PID 406 has exited with exit code 1
[2021-01-05T07:23:58,443][INFO ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [instance-0000000002] [gc][1866] overhead, spent [302ms] collecting in the last [1s]

I tried to enable a high level of log, however it presents the following error

"type" : "illegal_argument_exception",
"reason" : "transient setting [org.elasticsearch.xpack.ml.process.logging], not recognized"
1 Like

Sorry, I missed off the logger bit. I should have been:

PUT _cluster/settings
{
    "transient" : {
        "logger.org.elasticsearch.xpack.ml.process.logging" : "DEBUG"
    }
}

[2021-01-05T07:19:46,927][WARN ][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [controller/363] [CDetachedProcessSpawner.cc@193] Child process with PID 406 has exited with exit code 1

OK, that narrows it down a bit, but sadly not much. These problems are very hard to solve because the autodetect process logging goes via Elasticsearch so if for some reason autodetect exits or crashes before it connects to Elasticsearch then there is no logging to say why. I will think of some possible reasons for the failure and get back to you.

The same process with debug level mode:

[2021-01-05T14:03:41,912][INFO ][org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager] [instance-0000000002] Opening job [test]
[2021-01-05T14:03:42,021][INFO ][org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager] [instance-0000000002] [test] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2021-01-05T14:03:54,260][ERROR][org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory] [instance-0000000002] Failed to connect to autodetect for job test
[2021-01-05T14:03:55,007][WARN ][org.elasticsearch.persistent.AllocatedPersistentTask] [instance-0000000002] task job-test failed with an exception
org.elasticsearch.ElasticsearchException: Failed to connect to autodetect for job test
        at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:55) ~[?:?]
        at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createAutodetectProcess(NativeAutodetectProcessFactory.java:105) ~[?:?]
        at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.create(AutodetectProcessManager.java:540) ~[?:?]
        at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.createProcessAndSetRunning(AutodetectProcessManager.java:491) ~[?:?]
        at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.access$200(AutodetectProcessManager.java:97) ~[?:?]
        at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager$2.doRun(AutodetectProcessManager.java:412) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.1.jar:7.10.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: java.io.FileNotFoundException: /tmp/elasticsearch-6577522640637707193/autodetect_test_log_192 (No such file or directory)
        at java.io.FileInputStream.open0(Native Method) ~[?:?]
        at java.io.FileInputStream.open(FileInputStream.java:211) ~[?:?]
        at java.io.FileInputStream.<init>(FileInputStream.java:153) ~[?:?]
        at java.io.FileInputStream.<init>(FileInputStream.java:108) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:288) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:277) ~[?:?]
        at java.security.AccessController.doPrivileged(AccessController.java:312) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:130) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:97) ~[?:?]
        at org.elasticsearch.xpack.ml.process.ProcessPipes.connectLogStream(ProcessPipes.java:139) ~[?:?]
        at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.start(AbstractNativeProcess.java:86) ~[?:?]
        at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.start(AbstractNativeProcess.java:137) ~[?:?]
        at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createAutodetectProcess(NativeAutodetectProcessFactory.java:95) ~[?:?]
        ... 9 more
[2021-01-05T14:04:00,321][DEBUG][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [controller/363] [CDetachedProcessSpawner.cc@291] Spawned './autodetect' with PID 411
[2021-01-05T14:04:10,443][WARN ][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [controller/363] [CDetachedProcessSpawner.cc@193] Child process with PID 411 has exited with exit code 1

In addition, I had to do a zero reinstallation of the ECE due to a disk problem on my server. Before it was working perfectly. I followed that documentation.

The only difference from the prerequisites is the Docker Version:

Client: Docker Engine - Community
 Version:           20.10.1
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        831ebea
 Built:             Tue Dec 15 04:34:30 2020
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          19.03.14
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       5eb3275d40
  Built:            Tue Dec  1 19:18:24 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Hello Robert,
I would like to know if you were able to analyze what may be happening here.

One possibility here is that the machine is running so slowly that it takes more than 10 seconds for the autodetect process to create the logging named pipe after startup. We have seen that happen when a virus checker is installed that slows down process startup a lot. It could possibly also happen if the allocator is completely overloaded by any process though. There is a setting xpack.ml.process_connect_timeout that could be increased from its default of 10s to something much larger like 100s to see if something like this is happening.

Do you have enough access to the ECE allocator to check what's on the Docker container disks when this problem occurs? If you do it would be interesting to know if what is missing is just the file /tmp/elasticsearch-6577522640637707193/autodetect_test_log_192 or the whole directory /tmp/elasticsearch-6577522640637707193.

Should I change the elasticsearch.yml settings for the elasticsearch node or the dedicated ML node? As the installation was done via ECE I need a dedicated ML node.

There was no such directory. I created it with permissions for the elastic user, as well as the file manually and the error persists.

Yes I have access.

I made the persistent configurations in the cluster settings, as below

PUT /_cluster/settings
{
    "persistent" : {
        "xpack.ml.process_connect_timeout" : "100s"
    }
}

"xpack" : {
      "monitoring" : {
        "collection" : {
          "enabled" : "true",
          "interval" : "10s"
        },
        "exporters" : {
          "__no-default-local__" : {
            "type" : "local",
            "enabled" : "false"
          },
          "found-user-defined" : {
            "host" : [
              "http://containerhost:9244"
            ],
            "headers" : {
              "x-found-cluster" : "c7b804309b664eb9a56a6433be6fed34"
            },
            "type" : "http",
            "enabled" : "true"
          }
        }
      },
      "ml" : {
        "max_model_memory_limit" : "2772MB",
        "process_connect_timeout" : "100s",
        "max_machine_memory_percent" : "71"
      }
    }

Logs when recreating the ML job:

[2021-01-11T18:21:28,390][INFO ][org.elasticsearch.common.settings.ClusterSettings] [instance-0000000002] updating [xpack.ml.process_connect_timeout] from [10s] to [100s]

[2021-01-11T18:23:05,547][INFO ][org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager] [instance-0000000002] Opening job [test]
[2021-01-11T18:23:05,583][INFO ][org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager] [instance-0000000002] [test] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2021-01-11T18:23:23,952][DEBUG][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [controller/363] [CDetachedProcessSpawner.cc@291] Spawned './autodetect' with PID 438
[2021-01-11T18:23:24,025][DEBUG][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [test] [autodetect/438] [CLogger.cc@329] Logger is logging to named pipe /tmp/elasticsearch-6577522640637707193/autodetect_test_log_192
[2021-01-11T18:23:24,026][DEBUG][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [test] [autodetect/438] [Main.cc@161] autodetect (64 bit): Version 7.10.1 (Build 11e1ac84105757) Copyright (c) 2020 Elasticsearch BV
[2021-01-11T18:23:24,026][DEBUG][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [test] [autodetect/438] [CProcessPriority_Linux.cc@33] Successfully increased OOM killer adjustment via /proc/self/oom_score_adj
[2021-01-11T18:23:24,099][DEBUG][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [test] [autodetect/438] [CSystemCallFilter_Linux.cc@141] Seccomp BPF filters available
[2021-01-11T18:23:24,100][DEBUG][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [test] [autodetect/438] [CSystemCallFilter_Linux.cc@167] Seccomp BPF installed
[2021-01-11T18:23:24,132][DEBUG][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [test] [autodetect/438] [CLimits.h@100] Using default value (1000000) for unspecified setting anomaly.maxtimebuckets
[2021-01-11T18:23:24,133][DEBUG][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [test] [autodetect/438] [CLimits.h@100] Using default value (3.5) for unspecified setting results.unusualprobabilitythreshold
[2021-01-11T18:23:24,176][INFO ][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [test] [autodetect/438] [CResourceMonitor.cc@74] Setting model memory limit to 11 MB
[2021-01-11T18:23:24,176][DEBUG][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [test] [autodetect/438] [CFieldConfig.cc@192] Reading config file /tmp/elasticsearch-6577522640637707193/fieldconfig9361715353264513222.conf
[2021-01-11T18:23:24,177][DEBUG][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [test] [autodetect/438] [CAnomalyDetectorModelConfig.cc@376] Reading model plot config file /tmp/elasticsearch-6577522640637707193/modelplotconfig14479936429081837338.conf
[2021-01-11T18:23:24,177][DEBUG][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [test] [autodetect/438] [CLengthEncodedInputParser.cc@48] Length encoded input parser input is not connected to stdin
[2021-01-11T18:23:24,177][DEBUG][org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler] [instance-0000000002] [test] [autodetect/438] [CCmdSkeleton.cc@30] No restoration source specified - will not attempt to restore state
[2021-01-11T18:23:24,691][INFO ][org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager] [instance-0000000002] Successfully set job state to [opened] for job [test]
[2021-01-11T18:23:29,274][INFO ][org.elasticsearch.xpack.ml.datafeed.DatafeedJob] [instance-0000000002] [test] Datafeed started (from: 1970-01-01T00:00:00.000Z to: 2021-01-11T18:23:03.252Z) with frequency [900000ms]

It seems that everything is normalized now. it was possible to create other ML jobs and get results.

Should I change the elasticsearch.yml settings for the elasticsearch node or the dedicated ML node? As the installation was done via ECE I need a dedicated ML node.

I made the persistent configurations in the cluster settings, as below

Yes, that's the correct way to do it. As you found out, xpack.ml.process_connect_timeout is a dynamic cluster-wide setting. Any changes to the default should ideally be in cluster state, not a config file.

It seems that everything is normalized now.

That is interesting. It means that ML jobs are very slow to start up on this allocator for some reason. It implies that the time taken to get from running the autodetect program to executing this line of code is greater than 10 seconds. Usually, it would only be a few milliseconds. I suspect the long delay is not actually in the code of the main() function, but in some other OS activity that takes place before main() is called.

Some things we have observed to cause such long startup times are:

  1. When starting the process involves accessing a remote disk that stalls for a long time, for example an AWS EBS volume where a block that gets accessed has to be copied from a snapshot on first access leading to a very long stall
  2. Virus checkers that scan the autodetect process on startup, and access the internet when they do this
  3. Very heavily overloaded machines, where for example a lot of swapping is taking place causing disk thrashing

There might be a completely different reason in your case. But it's worth being aware that this unexpected slowness might be a sign of some other problem you will suffer from in the future.

Thank you very much for your help David!

Everything was very well understood and clarified.

Something I remembered, I configured the Elastic Agent to enable the collection of the Endpoint. Do you believe that something may be happening here too?

Something I remembered, I configured the Elastic Agent to enable the collection of the Endpoint. Do you believe that something may be happening here too?

Yes, that could be the reason. We found on one of our internal QA systems that enabling file events in Elastic Agent resulted in the machine getting slowed down enormously when there was heavy activity on the Elasticsearch data directory. This was on a machine hosting a single node, but I imagine it would be even worse for an ECE allocator, as that's then potentially hosting many Elasticsearch data directories. We disabled file events for Agent in Fleet to resolve this.