ElasticSearch 2-out-of-4 Master Replica goes down


We bootstraped ELK Stack (in kubernetes using helm) to collect all the audit and application logs of the system. This has been running for almost a year now, and we were able to achieve the high-availability using this setup (4 master replica node and 6 data replica node).

However recently I received an alert recently that the elasticsearch has a yellow health because there is a missing replica shards. When i checked the service, only 2-out-of-4 master nodes are running. I have tried to restart the two not running pods/replica but i'm getting a crashLoopBackOff.

I'm just new to elastic, but will there be a recommendation on how can I achieve green status again on elasticsearch service?

Which version of Elasticsearch are you running?

Hi Christian,

I'm using elasticseach:7.14.0

Elasticsearch requires a strict majority of master eligible nodes to be available in order to be able to elect a master and function properly, so if 2 out of 4 master eligibel nodes are unavailable I would expect the cluster to be red (as the majority of 4 is 3).

To safely get the cluster back online you will need to restore at least one of the downed master eligible nodes. Otherwise you may need to restore from snapshot.

If this is not possible there may be unsafe ways to address this, which could result in data loss. I am however not familiar enough with this to provide any guidance.

Hi Christian,

I read some documentation that to restart this, it needs to be controlled via operator. I'm not just sure how to do it though.

What errors are the 2 failing nodes logging?

Hi Christian, i will post below the std-err out that i get from the pods. I just want to mention that the cluster also has enough cpu and memory.

{"type": "server", "timestamp": "2022-10-06T11:13:20,901Z", "level": "INFO", "component": "o.e.i.g.DatabaseRegistry", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "initialized database registry, using geoip-databases directory [/tmp/elasticsearch-11429371485660161874/geoip-databases/SXTUaJIfQBadNKAIEJhuwQ]" }
{"type": "server", "timestamp": "2022-10-06T11:13:21,476Z", "level": "INFO", "component": "o.e.t.NettyAllocator", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "creating NettyAllocator with the following configs: [name=elasticsearch_configured, chunk_size=1mb, suggested_max_allocation_size=1mb, factors={es.unsafe.use_netty_default_chunk_and_page_size=false, g1gc_enabled=true, g1gc_region_size=4mb}]" }
{"type": "server", "timestamp": "2022-10-06T11:13:21,549Z", "level": "INFO", "component": "o.e.d.DiscoveryModule", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "using discovery type [zen] and seed hosts providers [settings]" }
{"type": "server", "timestamp": "2022-10-06T11:13:22,027Z", "level": "INFO", "component": "o.e.g.DanglingIndicesState", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually" }
{"type": "server", "timestamp": "2022-10-06T11:13:22,495Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "initialized" }
{"type": "server", "timestamp": "2022-10-06T11:13:22,496Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "starting ..." }
{"type": "server", "timestamp": "2022-10-06T11:13:22,593Z", "level": "INFO", "component": "o.e.x.s.c.f.PersistentCache", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "persistent cache index loaded" }
{"type": "server", "timestamp": "2022-10-06T11:13:22,730Z", "level": "INFO", "component": "o.e.t.TransportService", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "publish_address {}, bound_addresses {[::]:9301}" }
{"type": "server", "timestamp": "2022-10-06T11:13:22,959Z", "level": "INFO", "component": "o.e.b.BootstrapChecks", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "bound or publishing to a non-loopback address, enforcing bootstrap checks" }
ERROR: [2] bootstrap checks failed. You must address the points described in the following [2] lines before starting Elasticsearch.
bootstrap check failure [1] of [2]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
bootstrap check failure [2] of [2]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
ERROR: Elasticsearch did not exit normally - check the logs at /usr/share/elasticsearch/logs/docker-cluster.log
{"type": "server", "timestamp": "2022-10-06T11:13:22,990Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "stopping ..." }
{"type": "server", "timestamp": "2022-10-06T11:13:23,009Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "stopped" }
{"type": "server", "timestamp": "2022-10-06T11:13:23,010Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "closing ..." }
{"type": "server", "timestamp": "2022-10-06T11:13:23,022Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "closed" }
{"type": "server", "timestamp": "2022-10-06T11:13:23,024Z", "level": "INFO", "component": "o.e.x.m.p.NativeController", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "Native controller process has stopped - no new native processes can be started" }

If you look at the error message it is clear something is wrong with the configuration. Have a look at this, correct it and see if the nodes are able to come back up.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.