ElasticSearch 2-out-of-4 Master Replica goes down

roybalderama · October 5, 2022, 6:38am

Hi,

We bootstraped ELK Stack (in kubernetes using helm) to collect all the audit and application logs of the system. This has been running for almost a year now, and we were able to achieve the high-availability using this setup (4 master replica node and 6 data replica node).

However recently I received an alert recently that the elasticsearch has a yellow health because there is a missing replica shards. When i checked the service, only 2-out-of-4 master nodes are running. I have tried to restart the two not running pods/replica but i'm getting a crashLoopBackOff.

I'm just new to elastic, but will there be a recommendation on how can I achieve green status again on elasticsearch service?

Christian_Dahlqvist · October 5, 2022, 6:44am

Which version of Elasticsearch are you running?

roybalderama · October 5, 2022, 6:47am

Hi Christian,

I'm using elasticseach:7.14.0

Christian_Dahlqvist · October 5, 2022, 8:04am

Elasticsearch requires a strict majority of master eligible nodes to be available in order to be able to elect a master and function properly, so if 2 out of 4 master eligibel nodes are unavailable I would expect the cluster to be red (as the majority of 4 is 3).

To safely get the cluster back online you will need to restore at least one of the downed master eligible nodes. Otherwise you may need to restore from snapshot.

If this is not possible there may be unsafe ways to address this, which could result in data loss. I am however not familiar enough with this to provide any guidance.

roybalderama · October 6, 2022, 6:41am

Hi Christian,

I read some documentation that to restart this, it needs to be controlled via operator. I'm not just sure how to do it though.

Christian_Dahlqvist · October 6, 2022, 7:21am

What errors are the 2 failing nodes logging?

roybalderama · October 6, 2022, 11:22am

Hi Christian, i will post below the std-err out that i get from the pods. I just want to mention that the cluster also has enough cpu and memory.

{"type": "server", "timestamp": "2022-10-06T11:13:20,901Z", "level": "INFO", "component": "o.e.i.g.DatabaseRegistry", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "initialized database registry, using geoip-databases directory [/tmp/elasticsearch-11429371485660161874/geoip-databases/SXTUaJIfQBadNKAIEJhuwQ]" }
{"type": "server", "timestamp": "2022-10-06T11:13:21,476Z", "level": "INFO", "component": "o.e.t.NettyAllocator", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "creating NettyAllocator with the following configs: [name=elasticsearch_configured, chunk_size=1mb, suggested_max_allocation_size=1mb, factors={es.unsafe.use_netty_default_chunk_and_page_size=false, g1gc_enabled=true, g1gc_region_size=4mb}]" }
{"type": "server", "timestamp": "2022-10-06T11:13:21,549Z", "level": "INFO", "component": "o.e.d.DiscoveryModule", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "using discovery type [zen] and seed hosts providers [settings]" }
{"type": "server", "timestamp": "2022-10-06T11:13:22,027Z", "level": "INFO", "component": "o.e.g.DanglingIndicesState", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually" }
{"type": "server", "timestamp": "2022-10-06T11:13:22,495Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "initialized" }
{"type": "server", "timestamp": "2022-10-06T11:13:22,496Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "starting ..." }
{"type": "server", "timestamp": "2022-10-06T11:13:22,593Z", "level": "INFO", "component": "o.e.x.s.c.f.PersistentCache", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "persistent cache index loaded" }
{"type": "server", "timestamp": "2022-10-06T11:13:22,730Z", "level": "INFO", "component": "o.e.t.TransportService", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "publish_address {10.244.0.2:9301}, bound_addresses {[::]:9301}" }
{"type": "server", "timestamp": "2022-10-06T11:13:22,959Z", "level": "INFO", "component": "o.e.b.BootstrapChecks", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "bound or publishing to a non-loopback address, enforcing bootstrap checks" }
ERROR: [2] bootstrap checks failed. You must address the points described in the following [2] lines before starting Elasticsearch.
bootstrap check failure [1] of [2]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
bootstrap check failure [2] of [2]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
ERROR: Elasticsearch did not exit normally - check the logs at /usr/share/elasticsearch/logs/docker-cluster.log
{"type": "server", "timestamp": "2022-10-06T11:13:22,990Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "stopping ..." }
{"type": "server", "timestamp": "2022-10-06T11:13:23,009Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "stopped" }
{"type": "server", "timestamp": "2022-10-06T11:13:23,010Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "closing ..." }
{"type": "server", "timestamp": "2022-10-06T11:13:23,022Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "closed" }
{"type": "server", "timestamp": "2022-10-06T11:13:23,024Z", "level": "INFO", "component": "o.e.x.m.p.NativeController", "cluster.name": "docker-cluster", "node.name": "elasticsearch-es-master-2", "message": "Native controller process has stopped - no new native processes can be started" }

Christian_Dahlqvist · October 6, 2022, 12:18pm

If you look at the error message it is clear something is wrong with the configuration. Have a look at this, correct it and see if the nodes are able to come back up.

system · November 3, 2022, 12:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
2 Nodes crashed, how to get last Node up an running Elasticsearch	5	239	July 29, 2022
5 node cluster breaks when master is shut down Elasticsearch	13	4491	July 5, 2017
How to restart an Elasticsearch cluster (2 master node, 2 data node, 1 voting-only master-eligible node) after a 1 master node and 1 data node failed due to hardware failure without losing data? Elasticsearch docker	6	608	November 14, 2023
Multiple master problem in elasticsearch 0.90.10 Elasticsearch	2	463	July 6, 2017
Multiple indexes break elasticsearch (2.3.1) cluster replication? Elasticsearch	5	395	January 18, 2019

ElasticSearch 2-out-of-4 Master Replica goes down

Related topics