Cluster failing at random, works again after restarting node

Recently I am having trouble keeping an Elasticsearch cluster working, the cluster seems to fail somewhere during the night. All queries submitted to the cluster result in timeouts, so it seems nodes 'hang' in a some state not propagating requests anymore.

After a manual restart of a (single) node the cluster becomes working again. Logs are very 'scattered' and do not point me to a cause directly. I have consulted with my hosting provider and they indicate there have not been any changes in hardware recently and our cluster has been running on the same hardware for about half a year now.

This behaviour stared to show after we did a rolling upgrade to 6.3.0 (from 6.2.4), but not immediately afterwards (a couple of days later).

Our nodes (we have 3 of them, so fairly small cluster) occupy about 50% of the allocated heap and there is no sign of a memory overflow. This is also the case for storage, about 50% is used. Nodes idle around 20% CPU usage so again no indication of needing more hardware.

I am basically stuck here and would greatly appreciate some pointers in where to look/search to resolve this issue. I can supply logs if needed

Thanks,
Stephan

Do you have ML enabled? If so, it's possible that this is #31683 which manifests as occasional "stuck" clusters triggered by the daily ML maintenance task which occurs early in the morning. The detailed diagnosis is in the Github ticket if you fancy checking that it is this and not something else. If it is this, the fix will be in 6.3.1, but disabling ML is another possible workaround in the meantime.

Thanks for your reply! I have tried to disable ML through setting the node.ml: false setting in the config file, but that alone did not resolve the issue. Today I also added the xpack.ml.enabled: false flag to ensure ML is turned off, but of course I do not yet know if that helps.

Do I need to specify both flags, or can we conclude that ML was already disabled?

I think it's the latter setting, xpack.ml.enabled: false, which is crucial. Disabling ML on individual nodes isn't enough - the problematic task is not an ML job and runs on the master.

Ok, then I will keep monitoring if this resolves the issue, will reply my findings.

So far so good! The cluster is still responding, and thus I think it did the trick. I see that the fix has already been made (#31691), so I will update to 6.3.1 as soon as it is available. Thanks for your help @DavidTurner!

1 Like

You're welcome @sromer. 6.3.1 is out.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.