Cluster constantly crashing after upgrade to 7.4

Hey guys,

I recently upgraded my 3 node ES cluster to version 7.4. This cluster, which has been running forever (literally years) without any real problems, just started falling apart since then. Nothing besides the ES version has changed: Each node runs Ubuntu 18.04 with Java 11, has 6Gb of RAM (half of that allocated to ES), there is only 10Gb of data in ~70 indices (and ~140 shards). Data intake and usage haven't changed. But since that update, the cluster only stays up for a few hours, after which each node dies one after the other, all with the stacktrace like https://pastebin.com/6Wqqxg6r. Since the trace points to a out-of-memory problem, I tried deleteing indices, playing around with GC settings, allocating both more and less memory to ES and locking memory according to https://www.elastic.co/guide/en/elasticsearch/reference/7.4/setting-system-settings.html#systemd. Nothing changed! I can't get my beloved cluster in a stable state. Intake volume seems to correlate with this and crash the nodes faster, but even just turning Xpack monitoring on is seemingly too much and kills one node after the other.

Another strange observation: While I can still delete indices, forcemerge just does nothing. I tried to merge segments in an attempt to give ES some air, but the command just instantly returns and leaves the deleted docs unttouched. Nothing in the logs about that.

After a weekend of desperation, I am at my wits end and don't have any inclination on what to try next. Any help would be highly appreciated!

Best,
Frederik

Hi @frederikwerner

Let's start with the networking/memory issue:

could you share the contents of your jvm.options file please so I can take a look?
It looks like you might just be missing the following line in there (added by default in 7.4 jvm.options but if you reused the file from a previous version it's missing):

-Dio.netty.allocator.numDirectArenas=0

Thanks!

1 Like

Hi @Armin_Braun,

thank you for your answer! I just checked and indeed, that line was missing. I just added it and things look promising. I will check back in a couple of days and update my answer accordingly.

Best,
Frederik

Hi @Armin_Braun,

after a few days, the stack remains stable and performs well. Thank you very much for the solution! I feel like this hint should be included in the breaking changes section of the ES release notes, since others might fall into this trap like I did :slight_smile:

Best,
Frederik

2 Likes

@frederikwerner thanks for the feedback and bringing this to our attention initially, much appreciated! And you're right, we should and will improve our upgrade experience here. Working on it in https://github.com/elastic/elasticsearch/pull/47782 and related issues right now.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.