[elasticsearch 8.3] Master node 'dies' once per week

We have 2 very similar self managed cluster, staging and production, on AWS and on arm architecture (graviton2) and amazon linux 2 base image: 3 master nodes and 3 datanodes. In all instances we have installed monit that check the elasticsearch pid.
The instances are all based on the same AMI, at boot it self determine if it is a master or data node, and the environment. The AMI is more or less the same since years ago, when it was based on x86 architecture (of course the Elasticsearch were changed since we moved to arm, but all the remaining custom scripts are the same.)
Now, since March, when we moved to arm (ver 7.17), we have the 'problem' that every week (more or less, there's not a precise time interval) monit says the process Elasticsearch with pid xxxxx is not running, so it restarts it. It ALWAYS happens to the elected master, and when it happens the re-election is super fast and no alarms are triggered. It never happens for a datanode (luckily), and it happens both for staging and production. We have a third cluster (logging) with the same AMI, where all 6 nodes are both master and data nodes, and it never happens here.

We recently upgraded all the cluster to Elasticsearch 8.3 but the problem persists...any idea?

The monit configuration is:

check process elasticsearch with pidfile /var/run/elasticsearch/elasticsearch.pid
    start program = "/bin/systemctl start elasticsearch"
    stop program = "/bin/systemctl stop elasticsearch"
    if cpu > 90% for 5 cycles then alert
    if totalmem > 90% for 5 cycles then alert
    group elasticsearch

and the elasticsearch.yml


    "bootstrap": {
        "memory_lock": true
    },
    "cloud": {
        "node": {
            "auto_attributes": true
        }
    },
    "cluster": {
        "name": "staging.elasticsearch",
        "routing": {
            "allocation": {
                "allow_rebalance": "always",
                "awareness": {
                    "attributes": "aws_availability_zone"
                },
                "enable": "all",
                "node_concurrent_recoveries": 11
            },
            "rebalance": {
                "enable": "all"
            }
        }
    },
    "discovery": {
        "ec2": {
            "endpoint": "ec2.eu-west-1.amazonaws.com",
            "groups": [
                "staging-elasticsearch-30mhz-com",
                "staging-master-elasticsearch-30mhz-com"
            ],
            "host_type": "public_dns"
        },
        "seed_providers": "ec2"
    },
    "gateway": {
        "expected_data_nodes": 3,
        "recover_after_data_nodes": 2
    },
    "indices": {
        "recovery": {
            "max_bytes_per_sec": "50mb"
        }
    },
    "network.host": [
        "_ec2:publicDns_",
        "localhost"
    ],
 "node": {
        "roles": [
            "master",
            "remote_cluster_client"
        ]
    },
    "path": {
        "data": "/dev/shm/elasticsearch/data",
        "logs": "/dev/shm/elasticsearch/log"
    },
    "xpack": {
        "security": {
            "enabled": false
        }
    }
}         

What do the Elasticsearch logs say on that node?

Thanks for looking onto!

The logs doesn't say anything: you just see that suddenly elasticsearch restarts.
This is from this morning:

[2022-08-05T08:35:00,017][INFO ][o.e.x.s.SnapshotLifecycleTask] [ip-172-xx-xx-xx.eu-west-1.compute.internal] snapshot lifecycle policy job [production-hourly-snapshots-1] issued new snapshot creation for [half-hour-production-snapshot-20220805.0835-cmqpki2gtksr5ejwhvqpsw] successfully
[2022-08-05T08:35:00,043][INFO ][o.e.s.SnapshotsService   ] [ip-172-xx-xx-xx.eu-west-1.compute.internal] snapshot [s3.production:half-hour-production-snapshot-20220805.0835-cmqpki2gtksr5ejwhvqpsw/iqqvkxtkRm2MCGJNoS6Psg] started
[2022-08-05T08:35:13,016][INFO ][o.e.s.SnapshotsService   ] [ip-172-xx-xx-xx.eu-west-1.compute.internal] snapshot [s3.production:half-hour-production-snapshot-20220805.0835-cmqpki2gtksr5ejwhvqpsw/iqqvkxtkRm2MCGJNoS6Psg] completed with state [SUCCESS]
[2022-08-05T09:05:00,001][INFO ][o.e.x.s.SnapshotLifecycleTask] [ip-172-xx-xx-xx.eu-west-1.compute.internal] snapshot lifecycle policy [production-hourly-snapshots] issuing create snapshot [half-hour-production-snapshot-20220805.0905-znoudy-qt1yo6k-9zyza1a]
[2022-08-05T09:05:00,002][INFO ][o.e.x.s.SnapshotLifecycleTask] [ip-172-xx-xx-xx.eu-west-1.compute.internal] snapshot lifecycle policy job [production-hourly-snapshots-1] issued new snapshot creation for [half-hour-production-snapshot-20220805.0905-znoudy-qt1yo6k-9zyza1a] successfully
[2022-08-05T09:05:00,055][INFO ][o.e.s.SnapshotsService   ] [ip-172-xx-xx-xx.eu-west-1.compute.internal] snapshot [s3.production:half-hour-production-snapshot-20220805.0905-znoudy-qt1yo6k-9zyza1a/UbmuieXrSxmIjdO__8F_MA] started
[2022-08-05T09:05:12,231][INFO ][o.e.s.SnapshotsService   ] [ip-172-xx-xx-xx.eu-west-1.compute.internal] snapshot [s3.production:half-hour-production-snapshot-20220805.0905-znoudy-qt1yo6k-9zyza1a/UbmuieXrSxmIjdO__8F_MA] completed with state [SUCCESS]
[2022-08-05T09:25:19,695][INFO ][o.e.n.Node               ] [ip-172-xx-xx-xx.eu-west-1.compute.internal] version[8.3.2], pid[5610], build[rpm/8b0b1f23fbebecc3c88e4464319dea8989f374fd/2022-07-06T15:15:15.901688194Z], OS[Linux/5.10.126-117.518.amzn2.aarch64/aarch64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/18.0.1.1/18.0.1.1+2-6]
[2022-08-05T09:25:19,700][INFO ][o.e.n.Node               ] [ip-172-xx-xx-xx.eu-west-1.compute.internal] JVM home [/usr/share/elasticsearch/jdk], using bundled JDK [true]
[2022-08-05T09:25:19,700][INFO ][o.e.n.Node               ] [ip-172-xx-xx-xx.eu-west-1.compute.internal] JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -Djava.security.manager=allow, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j2.formatMsgNoLookups=true, -Djava.locale.providers=SPI,COMPAT, --add-opens=java.base/java.io=ALL-UNNAMED, -XX:+UseG1GC, -Djava.io.tmpdir=/tmp/elasticsearch-15980518567505792022, -XX:+HeapDumpOnOutOfMemoryError, -XX:+ExitOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/elasticsearch, -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Des.transport.cname_in_publish_address=true, -Des.search.ignore_awareness_attributes=true, -Xms2304m, -Xmx2304m, -XX:MaxDirectMemorySize=1207959552, -XX:G1HeapRegionSize=4m, -XX:InitiatingHeapOccupancyPercent=30, -XX:G1ReservePercent=15, -Des.distribution.type=rpm, --module-path=/usr/share/elasticsearch/lib, -Djdk.module.main=org.elasticsearch.server]
[2022-08-05T09:25:22,319][INFO ][c.a.c.i.j.JacksonVersion ] [ip-172-xx-xx-xx.eu-west-1.compute.internal] Package versions: jackson-annotations=2.13.2, jackson-core=2.13.2, jackson-databind=2.13.2.2, jackson-dataformat-xml=2.13.2, jackson-datatype-jsr310=2.13.2, azure-core=1.27.0, Troubleshooting version conflicts: https://aka.ms/azsdk/java/dependency/troubleshoot
[2022-08-05T09:25:24,360][INFO ][o.e.p.PluginsService     ] [ip-172-xx-xx-xx.eu-west-1.compute.internal] loaded module [aggs-matrix-stats]
[2022-08-05T09:25:24,360][INFO ][o.e.p.PluginsService     ] [ip-172-xx-xx-xx.eu-west-1.compute.internal] loaded module [analysis-common]
[2022-08-05T09:25:24,360][INFO ][o.e.p.PluginsService     ] [ip-172-xx-xx-xx.eu-west-1.compute.internal] loaded module [constant-keyword]
[2022-08-05T09:25:24,361][INFO ][o.e.p.PluginsService     ] [ip-172-xx-xx-xx.eu-west-1.compute.internal] loaded module [data-streams]
[2022-08-05T09:25:24,361][INFO ][o.e.p.PluginsService     ] [ip-172-xx-xx-xx.eu-west-1.compute.internal] loaded module [frozen-indices]
[2022-08-05T09:25:24,361][INFO ][o.e.p.PluginsService     ] [ip-172-xx-xx-xx.eu-west-1.compute.internal] loaded module [ingest-common]

It was happily doing snapshots every half an hour and then suddenly it was restarted.

It means that monit didn't found its pid in the active processes, right?

That in itself is informative, and generally means that Elasticsearch was killed by the OOM killer because the OS needed its memory for something else. Your heap size is 2304MiB, and Elasticsearch also needs at least 1152MiB of direct memory, plus a small amount of other overhead to run the JVM itself. Make sure your instance has enough memory for all that and doesn't run any other memory-heavy processes alongside Elasticsearch. Use dmesg to check your kernel logs for more details.

1 Like

Thank you! I'll do it (later).

BTW, the master instance has always been a 4gb RAM instance, and we let Elasticsearch decide how much heap to reserve. The only other process running is metricbeat (and monit). Maybe they are using more memory...

EDIT: actually metricbeat is a recent addition, maybe it's the final straw :sweat_smile:

Depends how you set it up, but yes by default Metricbeat gets a lot of its cluster-wide metrics from the elected master and that can be pretty heavyweight. See these docs, particularly the bit about using scope: cluster in step 2.

Thanks again, I've verified that it was indeed the OOM killer via dmesg. Now I have 2 possibilities:

  1. Bigger instances (more RAM)
  2. Manually set the heap (as I was doing before, reserving 2000 instead of 2300)

Before 'giving up' by implementing 1, I'll give a shot to 2.

I'm curious what was confounding the auto-heap-sizing process. 2304MiB of heap + 1152MiB of direct memory should fit just fine on a 4GiB instance. Did you find anything else running alongside Elasticsearch that had heavy memory needs? If so, a third option would be to address that and leave Elasticsearch alone.

The other running processes are: monit, metricbeat and filebeat. Then there is amazon-ssm-agen and all the default aws stuff, all with very small memory usage.

I re-introduced the script that was setting the heap: it checks at boot if the instance is a masternode, if it is the case it reads the heap value from the userdata an it sets the value in /etc/elasticsearch/jvm.options.d/myfile.options

I've 'rotated' the masternodes and now I'll wait 1 week to see if it solves the problem (I've set 2GB of heap).