Recently I am having trouble keeping an Elasticsearch cluster working, the cluster seems to fail somewhere during the night. All queries submitted to the cluster result in timeouts, so it seems nodes 'hang' in a some state not propagating requests anymore.
After a manual restart of a (single) node the cluster becomes working again. Logs are very 'scattered' and do not point me to a cause directly. I have consulted with my hosting provider and they indicate there have not been any changes in hardware recently and our cluster has been running on the same hardware for about half a year now.
This behaviour stared to show after we did a rolling upgrade to 6.3.0 (from 6.2.4), but not immediately afterwards (a couple of days later).
Our nodes (we have 3 of them, so fairly small cluster) occupy about 50% of the allocated heap and there is no sign of a memory overflow. This is also the case for storage, about 50% is used. Nodes idle around 20% CPU usage so again no indication of needing more hardware.
I am basically stuck here and would greatly appreciate some pointers in where to look/search to resolve this issue. I can supply logs if needed
Do you have ML enabled? If so, it's possible that this is #31683 which manifests as occasional "stuck" clusters triggered by the daily ML maintenance task which occurs early in the morning. The detailed diagnosis is in the Github ticket if you fancy checking that it is this and not something else. If it is this, the fix will be in 6.3.1, but disabling ML is another possible workaround in the meantime.
Thanks for your reply! I have tried to disable ML through setting the node.ml: false setting in the config file, but that alone did not resolve the issue. Today I also added the xpack.ml.enabled: false flag to ensure ML is turned off, but of course I do not yet know if that helps.
Do I need to specify both flags, or can we conclude that ML was already disabled?
I think it's the latter setting, xpack.ml.enabled: false, which is crucial. Disabling ML on individual nodes isn't enough - the problematic task is not an ML job and runs on the master.
So far so good! The cluster is still responding, and thus I think it did the trick. I see that the fix has already been made (#31691), so I will update to 6.3.1 as soon as it is available. Thanks for your help @DavidTurner!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.