Cluster failing at random, works again after restarting node

sromer · July 4, 2018, 10:16am

Recently I am having trouble keeping an Elasticsearch cluster working, the cluster seems to fail somewhere during the night. All queries submitted to the cluster result in timeouts, so it seems nodes 'hang' in a some state not propagating requests anymore.

After a manual restart of a (single) node the cluster becomes working again. Logs are very 'scattered' and do not point me to a cause directly. I have consulted with my hosting provider and they indicate there have not been any changes in hardware recently and our cluster has been running on the same hardware for about half a year now.

This behaviour stared to show after we did a rolling upgrade to 6.3.0 (from 6.2.4), but not immediately afterwards (a couple of days later).

Our nodes (we have 3 of them, so fairly small cluster) occupy about 50% of the allocated heap and there is no sign of a memory overflow. This is also the case for storage, about 50% is used. Nodes idle around 20% CPU usage so again no indication of needing more hardware.

I am basically stuck here and would greatly appreciate some pointers in where to look/search to resolve this issue. I can supply logs if needed

Thanks,
Stephan

DavidTurner · July 4, 2018, 11:06am

Do you have ML enabled? If so, it's possible that this is #31683 which manifests as occasional "stuck" clusters triggered by the daily ML maintenance task which occurs early in the morning. The detailed diagnosis is in the Github ticket if you fancy checking that it is this and not something else. If it is this, the fix will be in 6.3.1, but disabling ML is another possible workaround in the meantime.

sromer · July 4, 2018, 11:15am

Thanks for your reply! I have tried to disable ML through setting the node.ml: false setting in the config file, but that alone did not resolve the issue. Today I also added the xpack.ml.enabled: false flag to ensure ML is turned off, but of course I do not yet know if that helps.

Do I need to specify both flags, or can we conclude that ML was already disabled?

DavidTurner · July 4, 2018, 11:24am

I think it's the latter setting, xpack.ml.enabled: false, which is crucial. Disabling ML on individual nodes isn't enough - the problematic task is not an ML job and runs on the master.

sromer · July 4, 2018, 11:29am

Ok, then I will keep monitoring if this resolves the issue, will reply my findings.

sromer · July 6, 2018, 7:13am

So far so good! The cluster is still responding, and thus I think it did the trick. I see that the fix has already been made (#31691), so I will update to 6.3.1 as soon as it is available. Thanks for your help @DavidTurner!

DavidTurner · July 6, 2018, 11:37am

You're welcome @sromer. 6.3.1 is out.

system · August 3, 2018, 11:37am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Strange cluster restart loop Elasticsearch	4	601	July 6, 2017
Master node hangs when multiple data nodes are shutdown at the same time Elasticsearch	6	954	July 6, 2017
ElasticSearch nodes fail at random Elasticsearch	1	295	July 6, 2017
First steps troubleshooting ES cluster crashes? Elasticsearch	9	3536	March 3, 2018
ES 5.4.1: Totally random cluster stalling (100% CPU) about 1-2 times per day: We're out of ideas Elasticsearch	8	1266	July 21, 2017

Cluster failing at random, works again after restarting node

Related topics