Disabling machine learning does not allow Elasticsearch to stop

I make a fresh install of ES 5.4.1 in Ubuntu 16.04 and now I can not stop the service gracefully.
I found that this occurs when I install X-Pack and explicitly disable the machine learning feature during the trial period.

Has anyone had this problem too?

Steps:

  1. OS: Ubuntu 16.04 (I have not tested on other SO)
  2. Install the deb ES 5.4.1 (one node cluster)
  3. Install X-Pack
  4. Do not install any license (uses trial)
  5. Disable machine learning in /etc/elasticsearch/elasticsearch.yml -> xpack.ml.enabled: false
  6. Start ES service (service elasticsearch start) and wait for yellow or green status
  7. Stop ES service (service elasticsearch stop) -> the service stuck

Thank you for any help

Hi,

We did see a problem similar to this during internal testing of 5.4.1, but fixed it before the final release.

Is the Elasticsearch log for this scenario small enough to add to a comment? If so it may shed some light on what's going wrong. I'm particularly interested in where messages relating to the controller process appear in relation to other messages, but if you could just paste the entire log that would probably be easiest.

DEBUG log if you need details: https://pastebin.com/Gt2qjPQ1

INFO log:

[2017-06-09T15:27:33,796][INFO ][o.e.n.Node ] [mynode1] initializing ...
[2017-06-09T15:27:33,937][INFO ][o.e.e.NodeEnvironment ] [mynode1] using [1] data paths, mounts [[/ (/dev/mapper/ubuntu--vg-root)]], net usable_space [353.8gb], net total_space [449.5gb], spins? [possibly], types [ext4]
[2017-06-09T15:27:33,937][INFO ][o.e.e.NodeEnvironment ] [mynode1] heap size [123.7mb], compressed ordinary object pointers [true]
[2017-06-09T15:27:33,949][INFO ][o.e.n.Node ] [mynode1] node name [mynode1], node ID [USuqs9gETX-4iASrWEKZWw]
[2017-06-09T15:27:33,950][INFO ][o.e.n.Node ] [mynode1] version[5.4.1], pid[21422], build[2cfe0df/2017-05-29T16:05:51.443Z], OS[Linux/4.4.0-79-generic/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_131/25.131-b11]
[2017-06-09T15:27:33,950][INFO ][o.e.n.Node ] [mynode1] JVM arguments [-Xms128m, -Xmx128m, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+DisableExplicitGC, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Des.path.home=/usr/share/elasticsearch]
[2017-06-09T15:27:35,393][INFO ][o.e.p.PluginsService ] [mynode1] loaded module [aggs-matrix-stats]
[2017-06-09T15:27:35,393][INFO ][o.e.p.PluginsService ] [mynode1] loaded module [ingest-common]
[2017-06-09T15:27:35,393][INFO ][o.e.p.PluginsService ] [mynode1] loaded module [lang-expression]
[2017-06-09T15:27:35,394][INFO ][o.e.p.PluginsService ] [mynode1] loaded module [lang-groovy]
[2017-06-09T15:27:35,394][INFO ][o.e.p.PluginsService ] [mynode1] loaded module [lang-mustache]
[2017-06-09T15:27:35,394][INFO ][o.e.p.PluginsService ] [mynode1] loaded module [lang-painless]
[2017-06-09T15:27:35,394][INFO ][o.e.p.PluginsService ] [mynode1] loaded module [percolator]
[2017-06-09T15:27:35,394][INFO ][o.e.p.PluginsService ] [mynode1] loaded module [reindex]
[2017-06-09T15:27:35,394][INFO ][o.e.p.PluginsService ] [mynode1] loaded module [transport-netty3]
[2017-06-09T15:27:35,394][INFO ][o.e.p.PluginsService ] [mynode1] loaded module [transport-netty4]
[2017-06-09T15:27:35,395][INFO ][o.e.p.PluginsService ] [mynode1] loaded plugin [x-pack]
[2017-06-09T15:27:37,609][INFO ][o.e.d.DiscoveryModule ] [mynode1] using discovery type [zen]
[2017-06-09T15:27:38,413][INFO ][o.e.n.Node ] [mynode1] initialized
[2017-06-09T15:27:38,413][INFO ][o.e.n.Node ] [mynode1] starting ...
[2017-06-09T15:27:38,648][INFO ][o.e.t.TransportService ] [mynode1] publish_address {127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}
[2017-06-09T15:27:41,726][INFO ][o.e.c.s.ClusterService ] [mynode1] new_master {mynode1}{USuqs9gETX-4iASrWEKZWw}{UhLkCZZsRJCHuI57-BLSQA}{localhost}{127.0.0.1:9300}{rack=plusoft}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2017-06-09T15:27:41,755][INFO ][o.e.h.n.Netty4HttpServerTransport] [mynode1] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}
[2017-06-09T15:27:41,758][INFO ][o.e.n.Node ] [mynode1] started
[2017-06-09T15:27:42,763][INFO ][o.e.l.LicenseService ] [mynode1] license [f6095cac-721e-4993-8d87-0e409cb2a917] mode [trial] - valid
[2017-06-09T15:27:42,785][INFO ][o.e.g.GatewayService ] [mynode1] recovered [8] indices into cluster_state
[2017-06-09T15:27:46,055][INFO ][o.e.c.r.a.AllocationService] [mynode1] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[.monitoring-es-2-2017.06.09][0]] ...]).
[2017-06-09T15:27:55,745][INFO ][o.e.n.Node ] [mynode1] stopping ...

Thanks for posting the log.

It's interesting there's no mention at all of the ML controller process starting. There are two likely reasons for this:

  1. Missing system libraries
  2. File permissions

Please can you try to run this command and see what happens:

/usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/controller --version

First try as user root, then as user elasticsearch (or whatever OS user Elasticsearch runs as).

If it works as root but not as elasticsearch then it's likely to be a permissions problem. There was a problem in 5.4.0 where the permissions were not set correctly on the ML native program directories if the install was done with a restrictive umask, for example 0077. I thought this was fixed in 5.4.1, but maybe there is still some situation in which it can occur. If the problem is permissions it can be fixed using this command:

sudo find /usr/share/elasticsearch/plugins -type d -exec chmod a+rx '{}' ';'

I have the same problem with the x-pack 5.4.1

Installed on 9th of June
OS Centos 7
ES 5.4.1 (Two nodes - one master, one data - it is our testing env)

The File permissions are fine in our case (both elasticsearch and root have access getting this on both

controller (64 bit): Version 5.4.1 (Build e77baac2f04897) Copyright (c) 2017 Elasticsearch BV

We also make the machine learning to true on both machine


xpack.graph.enabled: false
xpack.ml.enabled: true
xpack.monitoring.enabled: true
xpack.security.enabled: false
xpack.watcher.enabled: false

It is stuck on the stopping elasticsearch (when I do restart, I need to kill elasticsearch from Linux then it is starting correctly again)

When I restart one of the nodes (it hangs) and the other node log starts to get error:

[2017-06-12T11:28:17,162][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-data-2][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-data-2][0]] containing [2] requests]]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:131) ~[?:?]
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_131]
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_131]
	at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_131]
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_131]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_131]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_131]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:132) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:115) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:583) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:389) [elasticsearch-5.4.1.jar:5.4.1]

and at the end of the log

[2017-06-12T11:28:17,165][WARN ][o.e.x.m.MonitoringService] [elastic-a] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: Exception when closing export bulk
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1$1.<init>(ExportBulk.java:106) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1.onFailure(ExportBulk.java:104) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:217) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:211) ~[?:?]
	at org.elasticsearch.xpack.common.IteratingActionListener.onResponse(IteratingActionListener.java:108) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$null$0(ExportBulk.java:175) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:67) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:138) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:115) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:583) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:389) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:384) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:827) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:797) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:862) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:699) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:653) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:816) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:311) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:238) [elasticsearch-5.4.1.jar:5.4.1]

How can we stop this from happening? We cannot update production cluster till this will not be corrected.

Hi, @droberts195, both root and elasticsearch users shows the same result:
controller (64 bit): Version 5.4.1 (Build e77baac2f04897) Copyright (c) 2017 Elasticsearch BV

I ran the chmod command and it also did not work.

My problem is a bit different from what was reported by @nan008 . The service does not stop when I explicitly add this line in the settings:
xpack.ml.enabled: false

@yoshioiwamoto we tried both xpack.ml.enabled: false and xpack.ml.enabled: true and in both cases we have the hung on stop elasticsearch. Is yours only not stopping when it is false? Should I open another ticket then?

@nan008 I'm pretty sure your problem is different to the one @yoshioiwamoto has. However, for both of you the next step is the same, which is to get a stack trace from the JVM that has hung.

You can do this using the jstack utility, that comes with the JDK. Reproduce the hang, find the PID of the hung JVM (by using ps or looking for the pid[21422] bit in the log file), then run:

$JAVA_HOME/bin/jstack 1234 > stacktrace.log

replacing 1234 with the correct PID of the JVM. jstack needs to be run as the same OS user that Elasticsearch is running as or root.

@nan008 - in your case the node to run this on is the one that is hung, which sounds like it is not the node that prints the monitoring exceptions but the other node.

This is the stacktrace.log, but I don't know if I ran the command in the right moment. I ran after trying to stop the service.

https://pastebin.com/FQCfnnGs

Thanks @yoshioiwamoto - the stack trace narrows down the problem. There are two more things you could do that would be useful for working out what's going wrong.

Firstly, while Elasticsearch is running, but before you shut it down, can you also see a process called controller running. In other words, what does ps -ef | grep controller return while Elasticsearch is running?

Secondly, open 3 terminal windows. In the first terminal run this:

/usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/controller

When you run it with no arguments it should just wait. Leave it running, move to the next terminal and run:

ls -ltr /tmp

Do you see a recent entry like this one?

prw-------  1 dave  dev       0 Jun 12 15:24 controller_log_908

The number at the end will be different, and obviously the owner too, but it should have those permissions and the name should begin with controller_log_. If you do see such a file, tail it like this:

tail -f /tmp/controller_log_908

Replace the 908 with the correct number for the file you see.

Finally, move to the next terminal and run this again:

ls -ltr /tmp

Now you should see a recent file that looks similar to this:

prw-------  1 dave  dev       0 Jun 12 15:24 controller_command_908

If you do, then run this:

yes > /tmp/controller_command_908

Once again replace the 908 with the number you see.

Now look back to the first terminal. Is the controller process still running?

Now go back to the terminal running yes. Use CTRL+C to stop it. Now look back to the first terminal. The controller process should have stopped.

Finally, look in the terminal running tail. It should have printed a number of error messages. These are due to controller not understanding the output of yes and are not relevant to the problem - just CTRL+C the tail command.

If something different happens for you then please let me know what you get.

Here is mine...

Both nodes are hanging when restarted while x-pack is on.

This is node b (data node)

Stacktrace

I also have the controller logs in /tmp

Should I tail today one?

Thanks @nan008. The stack trace shows that your problem is basically the same as the one @yoshioiwamoto has got.

There's no benefit in tailing any of those files that already exist in your /tmp directory, but if you could go through the process described in Disabling machine learning does not allow Elasticsearch to stop and let me know the outcome that would be great.

One further question: for the run where you obtained the stack trace, was machine learning enabled or disabled?

service elasticsearch start

ps -ef | grep controller

elastic+ 11971 11922 0 13:18 ? 00:00:00 /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/controller

I stopped the ES service and ran only the controller program, but it not created any new file in /tmp. So I run ES and the controller in the same sequence as you say. I checked that the file was created only by the ES service, running only the controller program does not create the file.
To be clear, I ran the ES service AND the controller program, so there are two process "controller" running.

ls -ltr /tmp

prw------- 1 elasticsearch elasticsearch 0 Jun 12 13:18 controller_log_11922

When I use the tail command makes the "controller_log" disapear and "controller_command" file appears.

ls -ltr /tmp

prw------- 1 elasticsearch elasticsearch 0 Jun 12 13:23 controller_command_11922

yes > /tmp/controller_command_11922

Result: The controller process still running (one from ES service and other from my root shell).

ps -ef | grep controller

elastic+ 11971 11922 13 13:18 ?        00:01:06 /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/controller
root     12059 11105  0 13:20 pts/18   00:00:00 /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/controller

When I stopped the "yes", only the ES controller was terminated.

tail -f /tmp/controller_log_11922

{"logger":"controller","timestamp":1497284856695,"level":"ERROR","pid":11971,"thread":"0x7f018a848780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497284856695,"level":"ERROR","pid":11971,"thread":"0x7f018a848780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497284856695,"level":"ERROR","pid":11971,"thread":"0x7f018a848780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497284856695,"level":"ERROR","pid":11971,"thread":"0x7f018a848780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497284856695,"level":"ERROR","pid":11971,"thread":"0x7f018a848780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497284856695,"level":"ERROR","pid":11971,"thread":"0x7f018a848780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497284856695,"level":"ERROR","pid":11971,"thread":"0x7f018a848780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497284856695,"level":"ERROR","pid":11971,"thread":"0x7f018a848780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497284856695,"level":"ERROR","pid":11971,"thread":"0x7f018a848780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497284856695,"level":"INFO","pid":11971,"thread":"0x7f018a848780","message":"Ml controller exiting","method":"main","file":"Main.cc","line":168}
tail: não foi possível verificar 'controller_log_11922': Arquivo ou diretório não encontrado (File or directory not found)
tail: inotify não pode ser utilizado, revertendo para verificação (inotify can not be used, reverting to verification)

for my stacktrace ML was enabled

have the same problem as @yoshioiwamoto

When I run both ES and Controller there is only ES controller_log file

The corresponding controller log

When I restart ES while the other controller is running, the controller_log is not created at all.

When I start the command for the controller while the ES is running, the controller_log for ES is also disappearing. Sometimes it is not created at all in /tmp

I manage only once the steps you given @droberts195, the same way that @yoshioiwamoto did and the controller was still running for root, not running for ES

{"logger":"controller","timestamp":1497348109877,"level":"ERROR","pid":19712,"thread":"0x7f4e74d90780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497348109877,"level":"ERROR","pid":19712,"thread":"0x7f4e74d90780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497348109877,"level":"ERROR","pid":19712,"thread":"0x7f4e74d90780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497348109877,"level":"ERROR","pid":19712,"thread":"0x7f4e74d90780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497348109877,"level":"ERROR","pid":19712,"thread":"0x7f4e74d90780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497348109877,"level":"ERROR","pid":19712,"thread":"0x7f4e74d90780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497348109877,"level":"ERROR","pid":19712,"thread":"0x7f4e74d90780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497348109877,"level":"ERROR","pid":19712,"thread":"0x7f4e74d90780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497348109877,"level":"ERROR","pid":19712,"thread":"0x7f4e74d90780","message":"Did not understand verb 'y'","class":"ml","method":"controller::CCommandProcessor::handleCommand","file":"CCommandProcessor.cc","line":88}
{"logger":"controller","timestamp":1497348109877,"level":"INFO","pid":19712,"thread":"0x7f4e74d90780","message":"Ml controller exiting","method":"main","file":"Main.cc","line":168}
tail: cannot watch ‘/tmp/controller_log_19669’: No such file or directory

EDIT: I managed the steps when the ml was disabled, and it looks like I have the same problem as OP. When I enable ML restarting, stopping and starting works. When I disable ML it is stuck on stopping or stopping when restarting (When I kill it, ES is in an active state so starting is not a problem in ML enable or disable, ONLY Stopping is the issue)

@yoshioiwamoto and @nan008 thank you for spending so much time helping to debug this.

I think I have worked out what's going wrong. There is a danger of this hang when the JVM process group is killed rather than just the process, and this is what systemctl stop does by default. I cannot reproduce it if ML is enabled, but with ML disabled it happens over half the time for me when on a systemd system. Also, it doesn't happen on older systems that use init.d scripts nor if you just run elasticsearch manually.

Anyway, I think the fix is this. As root, edit /usr/lib/systemd/system/elasticsearch.service and immediately after this existing line:

KillSignal=SIGTERM

add this new line:

KillMode=process

Save the changed file, and run this:

sudo systemctl daemon-reload

Then try to start and stop Elasticsearch again.

I can never reproduce the problem after making this change. Please can you try it and let me know if you still have problems. I'll raise a PR to get this change into a release and also make some other robustness improvements in the X-Pack code.

1 Like

Thanks ever so much. The service know stops. These are the last lines in the log file.

[2017-06-13T11:01:31,673][INFO ][o.e.n.Node               ] [mynode1] stopping ...
[2017-06-13T11:01:31,729][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [controller/8176] [Main.cc@128] controller (64 bit): Version 5.4.1 (Build e77baac2f04897) Copyright (c) 2017 Elasticsearch BV
[2017-06-13T11:01:31,729][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [controller/8176] [Main.cc@168] Ml controller exiting
[2017-06-13T11:01:31,731][INFO ][o.e.x.m.j.p.NativeController] Native controller process has stopped - no new native processes can be started
[2017-06-13T11:01:32,703][INFO ][o.e.n.Node               ] [mynode1] stopped
[2017-06-13T11:01:32,703][INFO ][o.e.n.Node               ] [mynode1] closing ...
[2017-06-13T11:01:32,710][INFO ][o.e.n.Node               ] [mynode1] closed

Mine is stopping too, after I added the line to both nodes. Thank you for that

When you will release the updated version for both ES and x-pack? we will need to implement this manually on all 15 nodes in prod and it would be great to know if we not need to worry about the config in near future.

did the x-pack package changed last week for the ES 5.4.1?

@nan008 the fix is https://github.com/elastic/elasticsearch/pull/25195 which is going to be released in version 5.4.2. I cannot say exactly when this will be released, but hopefully it won't be that far off.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.