Cluster constantly crashing after upgrade to 7.4

Hey guys,

I recently upgraded my 3 node ES cluster to version 7.4. This cluster, which has been running forever (literally years) without any real problems, just started falling apart since then. Nothing besides the ES version has changed: Each node runs Ubuntu 18.04 with Java 11, has 6Gb of RAM (half of that allocated to ES), there is only 10Gb of data in ~70 indices (and ~140 shards). Data intake and usage haven't changed. But since that update, the cluster only stays up for a few hours, after which each node dies one after the other, all with the stacktrace like https://pastebin.com/6Wqqxg6r. Since the trace points to a out-of-memory problem, I tried deleteing indices, playing around with GC settings, allocating both more and less memory to ES and locking memory according to https://www.elastic.co/guide/en/elasticsearch/reference/7.4/setting-system-settings.html#systemd. Nothing changed! I can't get my beloved cluster in a stable state. Intake volume seems to correlate with this and crash the nodes faster, but even just turning Xpack monitoring on is seemingly too much and kills one node after the other.

Another strange observation: While I can still delete indices, forcemerge just does nothing. I tried to merge segments in an attempt to give ES some air, but the command just instantly returns and leaves the deleted docs unttouched. Nothing in the logs about that.

After a weekend of desperation, I am at my wits end and don't have any inclination on what to try next. Any help would be highly appreciated!

Best,
Frederik

Hi @frederikwerner

Let's start with the networking/memory issue:

could you share the contents of your jvm.options file please so I can take a look?
It looks like you might just be missing the following line in there (added by default in 7.4 jvm.options but if you reused the file from a previous version it's missing):

-Dio.netty.allocator.numDirectArenas=0

Thanks!

Hi @Armin_Braun,

thank you for your answer! I just checked and indeed, that line was missing. I just added it and things look promising. I will check back in a couple of days and update my answer accordingly.

Best,
Frederik