Cluster constantly crashing after upgrade to 7.4

frederikwerner · October 7, 2019, 6:59pm

Hey guys,

I recently upgraded my 3 node ES cluster to version 7.4. This cluster, which has been running forever (literally years) without any real problems, just started falling apart since then. Nothing besides the ES version has changed: Each node runs Ubuntu 18.04 with Java 11, has 6Gb of RAM (half of that allocated to ES), there is only 10Gb of data in ~70 indices (and ~140 shards). Data intake and usage haven't changed. But since that update, the cluster only stays up for a few hours, after which each node dies one after the other, all with the stacktrace like https://pastebin.com/6Wqqxg6r. Since the trace points to a out-of-memory problem, I tried deleteing indices, playing around with GC settings, allocating both more and less memory to ES and locking memory according to https://www.elastic.co/guide/en/elasticsearch/reference/7.4/setting-system-settings.html#systemd. Nothing changed! I can't get my beloved cluster in a stable state. Intake volume seems to correlate with this and crash the nodes faster, but even just turning Xpack monitoring on is seemingly too much and kills one node after the other.

Another strange observation: While I can still delete indices, forcemerge just does nothing. I tried to merge segments in an attempt to give ES some air, but the command just instantly returns and leaves the deleted docs unttouched. Nothing in the logs about that.

After a weekend of desperation, I am at my wits end and don't have any inclination on what to try next. Any help would be highly appreciated!

Best,
Frederik

Armin_Braun · October 9, 2019, 7:23am

Hi @frederikwerner

Let's start with the networking/memory issue:

could you share the contents of your jvm.options file please so I can take a look?
It looks like you might just be missing the following line in there (added by default in 7.4 jvm.options but if you reused the file from a previous version it's missing):

-Dio.netty.allocator.numDirectArenas=0

Thanks!

frederikwerner · October 11, 2019, 2:55pm

Hi @Armin_Braun,

thank you for your answer! I just checked and indeed, that line was missing. I just added it and things look promising. I will check back in a couple of days and update my answer accordingly.

Best,
Frederik

frederikwerner · October 22, 2019, 6:10pm

Hi @Armin_Braun,

after a few days, the stack remains stable and performs well. Thank you very much for the solution! I feel like this hint should be included in the breaking changes section of the ES release notes, since others might fall into this trap like I did

Best,
Frederik

Armin_Braun · October 22, 2019, 6:23pm

@frederikwerner thanks for the feedback and bringing this to our attention initially, much appreciated! And you're right, we should and will improve our upgrade experience here. Working on it in https://github.com/elastic/elasticsearch/pull/47782 and related issues right now.

system · November 19, 2019, 6:23pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Please help - ES 2.1.1 cluster randomly crashing Elasticsearch	18	2544	July 5, 2017
First steps troubleshooting ES cluster crashes? Elasticsearch	9	3538	March 3, 2018
Elasticsearch always crashes with query aggregations Elasticsearch	21	8071	July 5, 2017
Elasticsearch cluster is crashing often Elasticsearch	7	776	December 12, 2020
Elasticsearch Cluster version 7.7 is not stable Elasticsearch	9	757	February 9, 2021

Cluster constantly crashing after upgrade to 7.4

Related topics