Elasticsearch version:7.13.4
Background Introduction
There are a total of 7 machines in the Elasticsearch cluster where the accident occurred:
- 127.0.204.193
- 127.0.204.194
- 127.0.204.195
- 127.0.220.73
- 127.0.220.74
- 127.0.220.220
- 127.0.220.221
The machine configurations of 193, 194, and 195 are the same, as follows:
- CPU: 32 cores
- Memory: 128G
- Disk: 4T * 3
System disk mounted separately: 40G
The machine configurations for 73, 74, 220, and 221 are the same, as follows:
- CPU: 32 cores
- Memory: 128G
- Disk: 10T
System disk mounted separately: 50G
The above 7 machines are all using Alibaba Cloud's efficient cloud disks, https://help.aliyun.com/zh/ecs/user-guide/disks-2
That is to say, the maximum throughput (read+write upper limit) is 140MB/s
Question
Due to being relatively poor, the deployment of the cluster is quite complex. There is a Kafka cluster, a ZooKeeper cluster, and two Elasticsearch clusters on these 7 machines.
Among them, Kafka and ZooKeeper clusters are deployed on 193, 194, and 195;
Two Elasticsearch clusters each have 7 nodes (meaning each of these 7 machines has two instances of Elasticsearch clusters);
The Kafka, ZooKeeper, and Elasticsearch clusters are all in use for the three disks on 193, 194, and 195.
Prior to this accident, there was a write delay in the Kafka cluster. It was analyzed that two Elasticsearch clusters shared a disk with the Kafka cluster, causing the disk IO to become full, resulting in the delay in the Kafka cluster.
So adjust the use of the three disks on 193, 194, and 195 to:
- A 4T disk for Kafka and ZooKeeper to use separately
- Another two 4T disks are mixed with two Elasticsearch clusters (the data volume of one Elasticsearch cluster is already very small, less than 10GB, so it can be ignored)
Alright, the previous review has been completed. Now let's talk about the issues that occurred this time: - At 5pm on the 23rd, the Elasticsearch cluster hung up
- The cluster does not have a master
- At 9am on the 25th, the Elasticsearch cluster resumed rebalance shards
Analysis
Because the issue on the 23rd was very urgent, the solution taken at that time was to restart the entire cluster and restore it first.
At that time, there were no master exception logs on 193
{"type": "server", "timestamp": "2023-10-23T09:22:08955Z", "level": "WARN", "component": "o.e.c.c. ClusterFormationFailureHelper", "cluster. name": "business log", "node. name": "es-b-193", "message": "master not discovered or selected yet, an election requires at least 4 nodes with ids from [hYrrmhLHTx QHoDmZ2wATg, jaNLhd1eT6SUYcLJkHpE1Q, 1VQFmt9jQ-6d7fCMjP-vnQ, uzGdH3VeRbOlcDIWeKdgIw, PhjlCea6TNKh4rdZPyPDkA, ZFQdNP4HSgCtPCnMfDdopw, 1WRaRyU-SuCPn8qbz3e4hg], Have discovered [{es-b-193} {ZFQdNP4HSgCtPCnMfDdopw} {k0x5Rog-S0CFhAmorJ3V0Q} {127.0.204.193} {127.0.204.193:9301} {cdfhilmrstw}, {es-b-221} {jaNLhd1eT6SUYcLJkHpE1Q} {XkD-b14XQzq1-o3EiDSuKA} {127.0.220.221} {127.0.220.21:9301} {cdfhilmrstw} PhjlCea6TNKh4rdZPyPDkA {6pQJDRevScad3hr_hJcnTg} {127.0.204.194} {127.0.204.194:9301} cdfhilmrstw}, {es-b-220} hYrrmhLHTx QHoDmZ2wATg} {Qyz9MF2bRmKQq9uPWgTBuw} {127.0.220:220} {127.0.220:220:9301} {cdfhilmrstw}, {es-b-195} {1VQFmt9jQ-6d7fCMjP-vnQ} {eP2NNXnTIWGzg7vE8EWkA} {127.0.204.195} {127.0.204.195:9301} {cdfhilmrstw}, {es-b-74} {uzGdH3VeRbOlcDIWeKdgIW} fhNTddq7Syi7zPDVCejDWQ} {127.0.220.74 {127.0.220.74:9301} {cdfhilmrstw}, {es-b-73} {1WRaRyU-SuCPn8qbz3e4hg} {1iRrVMWcSNGGxsuQYxePSg} {127.0.220.73} {127.0.220.73:9301} {cdfhilmrstw}] which is a quorum; Discovery will continue using [127.0.204.194:9301, 127.0.204.195:9301] from hosts providers and [{es-b-220} hYrrmhLHTx QHoDmZ2wATg} {Qyz9MF2bRmKQq9uPWgTBuw} {127.0.220:20:9301} cdfhilmrstw}, {es-b-73} {1WRaRyU-SuCPn8qbz3e4hg} {1iRrVMWcSNG GxsuQYxePSg} {127.0.220.73} {127.0.220.73:9301} {cdfhilmrstw}, {es-b-193} {ZFQdNP4HSgCtPCnMfDdopw} {k0x5Rog-S0CFhAmorJ3V0Q} {127.0.204.193} {127.0.204.193:9301} cdfhilmrstw}, {es-b-74} uzGdH3VeRbOlcDIWeKdgIW} fhNTddq7Syi7zPDVCejDWQ} {127.0.220.74} {127.0.220.74:9301} cdfhilmrstw}, {es-b-194} PhjlCea6TNKh4rdZPyPDkA} 6pQJDRevScad3hr_hJcnTg} {127.0.204.194} {127.0.204.194:9301} {cdfhilmrstw}, {es-b-221} {jaNLhd1eT6SUYcLJkHpE1Q} {XkD-b14XQzq1-o3EiDSuKA} {127.0.220.221} {127.0.220.221: 9301} cdfhilmrstw}, {es-b-195} {1VQFmt9jQ-6d7fCMjP-vnQ} {e_P2NNXnTIWGzg7vE8EWkA} {127.0.204.195} {127.0.204.195:9301} {cdfhilmrstw}] from last known cluster state; Node term 43, last accepted version 397615 in term 43 "," cluster. uuid ":" ArYY qmCTbCQTDUI8ogsBg "," node. id ":" ZFQdNP4HSgCtPCnMfDdopw "}
{"type": "server", "timestamp": "2023-10-23T09:22:111119Z", "level": "WARN", "component": "r. suppressed", "cluster. name": "business log", "node. name": "es-b-193", "message": "path:/_cat/nodes, params: {h=ip, name, heap. percent, heap. current, heap. max, ram. percent, ram. current, ram. max, ram. node. role, master, CPU, load_1m, load_5m, load_15m, disk. used_percent, disk. used, disk. total}", Cluster. uuid ":" ArYy qmCTbCQTDUI8ogsBg "," node. id ":" ZFQdNP4HSgCtPCnMfDdopw ",
Org. lasticsearch. cluster. block. ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master]\ , n "," stream ":" stdout "," time ":" 2023-10-23T09:21:34:490903538Z "}
{"log": "" at org. lasticsearch. cluster. block. ClusterBlocks. globalBlockedException (ClusterBlocks. java: 179
After recovery, the main phenomenon observed in troubleshooting logs is frequent addition and removal of node 193
Added {es-b-193} {ZFQdNP4HSgCtPCnMfDdopw} {k0x5Rog-S0CFhAmorJ3V0Q} {127.0.204.193} {127.0.204.193:9301} {cdfhilmrstw}}, term: 43, version: 397620, Reason: ApplyCommitRequest {term=43, version=397620, sourceNode={es-b-73} {1WRaRyU-SuCPn8qbz3e4hg} {1iRrVMWcSNGGxsuQYxePSg} {127.0.220.73} {127.0.220.73:9301} {cdfhilmrstw} {ml. machine_memory=133070966784, ml. max_open_jobs=512, xpack. installed=true, ml. max_jvm_size=3328599654 4. transform. node=true}}
Removed {es-b-193} {ZFQdNP4HSgCtPCnMfDdopw} {k0x5Rog-S0CFhAmorJ3V0Q} {127.0.204.193} {127.0.204.193:9301} {cdfhilmrstw}}, term: 43, version: 397621, Reason: ApplyCommitRequest {term=43, version=397621, sourceNode={es-b-73} {1WRaRyU-SuCPn8qbz3e4hg} {1iRrVMWcSNGGxsuQYxePSg} {127.0.220.73} {127.0.220.73:9301} {cdfhilmrstw} {ml. machine_memory=133070966784, ml. max_open_jobs=512, xpack. installed=true, ml. max_jvm_size=3328599654 4. transform. node=true}}
But the CPU, memory, and disk IO loads of these nodes are not very high, which is relatively normal
I don't understand why 193 is frequently added and removed?
I found the following logs on master node 194
root@jiankunking-es-02:~# docker logs es-b-194 | grep node-left
{"type": "server", "timestamp": "2023-10-23T09:33:18,042Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{127.0.220.221}{127.0.220.221:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 399161, delta: removed {{es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{127.0.220.221}{127.0.220.221:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:52:35,140Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw} reason: followers check retry count exceeded], term: 52, version: 404805, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:53:24,564Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 404807, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:53:30,879Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 404810, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T02:11:44,481Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw} reason: followers check retry count exceeded], term: 52, version: 405235, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T03:22:03,390Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw} reason: followers check retry count exceeded], term: 52, version: 407007, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }