When half more master node down, cluster can't work as normal

chenyg0911 · July 16, 2021, 2:50am

when half more master node down, cluster can't work as normal.

opened 08:50AM - 12 Jul 21 UTC

triage

## Proposal **Use case. Why is this important?** ## Bug Report **What did you do?** kill master pods more than half **What did you expect to see?** when master pods started, cluster restore to normal. **What did you see instead? Under which circumstances?** master pods running, but cluster is unhealth. **Environment** * ECK version: eck version 1.6 * Kubernetes information: insert any information about your Kubernetes environment that could help us: * On premise ? VM runs on kvm * Cloud: GKE / EKS / AKS ? No * Kubernetes distribution: kubespray for each of them please give us the version you are using ``` $ kubectl version ``` Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-13T02:40:46Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:32:49Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"} * Resource definition: ``` apiVersion: elasticsearch.k8s.elastic.co/v1 kind: Elasticsearch metadata: name: es-v7132 spec: version: 7.13.2 http: service: spec: type: LoadBalancer tls: selfSignedCertificate: disabled: true nodeSets: - name: masters count: 3 config: node.roles: ["master","data","ingest"] # before 7.9.0, use below # node.master: true # node.data: true # node.ingest: true # node.store.allow_mmap: false podTemplate: spec: initContainers: - name: sysctl securityContext: privileged: true command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144'] containers: - name: elasticsearch env: - name: ES_JAVA_OPTS value: -Xms4g -Xmx4g resources: requests: memory: 6Gi cpu: 0.5 limits: memory: 6Gi cpu: 2 volumeClaimTemplates: - metadata: name: es-v7132-data spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi ``` * produce the proble: After all master started. I worked fine. kill the two pods: `kubectl delete pod es-v7132-es-masters-1 es-v7132-es-masters-2` the two pods will restart, but the cluster discover is failed. Cluster can't work as normal. curl -u elastic:$PASSWD es-http-svc:9200/_cat/nodes?v {"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503} * logs: pod es-v7132-es-masters-0 : {"type": "server", "timestamp": "2021-07-12T08:28:09,235Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-v7132", "node.name": "es-v7132-es-masters-0", "message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [qxvFr7unQji6E3JVBDrOKQ, OzuTW949R1C9N1LHsHktUg, V8FWuqLFSNyy7GqvIuVo5A], have discovered [{es-v7132-es-masters-0}{OzuTW949R1C9N1LHsHktUg}{euLJjwHfTxGw3DfQmWBPQg}{10.233.96.222}{10.233.96.222:9300}{dim}, {es-v7132-es-masters-2}{MQJtnhEIRk63YY9uMgBSJA}{zbBssJK9QDmCusEucfNOfg}{10.233.92.215}{10.233.92.215:9300}{dim}, {es-v7132-es-masters-1}{WLiiCtvVT_qeMFXMjPZS1w}{HfK-5srjSimg9rTakR_yHw}{10.233.90.164}{10.233.90.164:9300}{dim}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, 10.233.90.163:9300, 10.233.92.214:9300] from hosts providers and [{es-v7132-es-masters-1}{qxvFr7unQji6E3JVBDrOKQ}{m0O6ZGLhRtS9dyAYk89HPg}{10.233.90.163}{10.233.90.163:9300}{dim}, {es-v7132-es-masters-2}{V8FWuqLFSNyy7GqvIuVo5A}{rfyW9KfGTP-pum6H7oX9Qg}{10.233.92.214}{10.233.92.214:9300}{dim}, {es-v7132-es-masters-0}{OzuTW949R1C9N1LHsHktUg}{euLJjwHfTxGw3DfQmWBPQg}{10.233.96.222}{10.233.96.222:9300}{dim}] from last-known cluster state; node term 9, last-accepted version 241 in term 9", "cluster.uuid": "JawL-M1PSNW5UjKCxyLa9A", "node.id": "OzuTW949R1C9N1LHsHktUg" } The restarted Pods like that: {"type": "server", "timestamp": "2021-07-12T08:42:03,856Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-v7132", "node.name": "es-v7132-es-masters-1", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{es-v7132-es-masters-1}{WLiiCtvVT_qeMFXMjPZS1w}{HfK-5srjSimg9rTakR_yHw}{10.233.90.164}{10.233.90.164:9300}{dim}, {es-v7132-es-masters-0}{OzuTW949R1C9N1LHsHktUg}{euLJjwHfTxGw3DfQmWBPQg}{10.233.96.222}{10.233.96.222:9300}{dim}, {es-v7132-es-masters-2}{MQJtnhEIRk63YY9uMgBSJA}{zbBssJK9QDmCusEucfNOfg}{10.233.92.215}{10.233.92.215:9300}{dim}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, 10.233.92.215:9300, 10.233.96.222:9300] from hosts providers and [{es-v7132-es-masters-1}{WLiiCtvVT_qeMFXMjPZS1w}{HfK-5srjSimg9rTakR_yHw}{10.233.90.164}{10.233.90.164:9300}{dim}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }

still looking for a solution.
I've to recreate the cluster, and all data lost after recreate the cluster.

sebgl · July 16, 2021, 7:23am

I answered in the github issue.

chenyg0911 · July 26, 2021, 3:58am

Solved! thanks sebgl!

system · August 23, 2021, 3:59am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Single master node cluster dies when master node dies Elastic Cloud on Kubernetes (ECK)	5	663	September 14, 2021
ECK pod re-scheduling explanation Elastic Cloud on Kubernetes (ECK)	2	397	November 4, 2022
ECK managed cluster and elasticsearch-node Elastic Cloud on Kubernetes (ECK)	2	460	November 4, 2022
Elastic Cloud on Kubernets not starting - advice needed on (probable) cause Elastic Cloud on Kubernetes (ECK)	3	772	November 4, 2022
ECK Cluster Freezes when network fails for one node Elastic Cloud on Kubernetes (ECK)	2	283	November 4, 2022

When half more master node down, cluster can't work as normal

Related topics