We have 2 very similar self managed cluster, staging and production, on AWS and on arm architecture (graviton2) and amazon linux 2 base image: 3 master nodes and 3 datanodes. In all instances we have installed monit that check the elasticsearch pid.
The instances are all based on the same AMI, at boot it self determine if it is a master or data node, and the environment. The AMI is more or less the same since years ago, when it was based on x86 architecture (of course the Elasticsearch were changed since we moved to arm, but all the remaining custom scripts are the same.)
Now, since March, when we moved to arm (ver 7.17), we have the 'problem' that every week (more or less, there's not a precise time interval) monit says the process Elasticsearch with pid xxxxx is not running, so it restarts it. It ALWAYS happens to the elected master, and when it happens the re-election is super fast and no alarms are triggered. It never happens for a datanode (luckily), and it happens both for staging and production. We have a third cluster (logging) with the same AMI, where all 6 nodes are both master and data nodes, and it never happens here.
We recently upgraded all the cluster to Elasticsearch 8.3 but the problem persists...any idea?
The monit configuration is:
check process elasticsearch with pidfile /var/run/elasticsearch/elasticsearch.pid
start program = "/bin/systemctl start elasticsearch"
stop program = "/bin/systemctl stop elasticsearch"
if cpu > 90% for 5 cycles then alert
if totalmem > 90% for 5 cycles then alert
group elasticsearch
and the elasticsearch.yml
"bootstrap": {
"memory_lock": true
},
"cloud": {
"node": {
"auto_attributes": true
}
},
"cluster": {
"name": "staging.elasticsearch",
"routing": {
"allocation": {
"allow_rebalance": "always",
"awareness": {
"attributes": "aws_availability_zone"
},
"enable": "all",
"node_concurrent_recoveries": 11
},
"rebalance": {
"enable": "all"
}
}
},
"discovery": {
"ec2": {
"endpoint": "ec2.eu-west-1.amazonaws.com",
"groups": [
"staging-elasticsearch-30mhz-com",
"staging-master-elasticsearch-30mhz-com"
],
"host_type": "public_dns"
},
"seed_providers": "ec2"
},
"gateway": {
"expected_data_nodes": 3,
"recover_after_data_nodes": 2
},
"indices": {
"recovery": {
"max_bytes_per_sec": "50mb"
}
},
"network.host": [
"_ec2:publicDns_",
"localhost"
],
"node": {
"roles": [
"master",
"remote_cluster_client"
]
},
"path": {
"data": "/dev/shm/elasticsearch/data",
"logs": "/dev/shm/elasticsearch/log"
},
"xpack": {
"security": {
"enabled": false
}
}
}