A node in the Elasticsearch cluster is frequently removed and added

jiankunking · October 25, 2023, 9:02am

Elasticsearch version:7.13.4

Background Introduction

There are a total of 7 machines in the Elasticsearch cluster where the accident occurred:

127.0.204.193
127.0.204.194
127.0.204.195
127.0.220.73
127.0.220.74
127.0.220.220
127.0.220.221

The machine configurations of 193, 194, and 195 are the same, as follows:

CPU: 32 cores
Memory: 128G
Disk: 4T * 3

System disk mounted separately: 40G
The machine configurations for 73, 74, 220, and 221 are the same, as follows:

CPU: 32 cores
Memory: 128G
Disk: 10T

System disk mounted separately: 50G
The above 7 machines are all using Alibaba Cloud's efficient cloud disks, https://help.aliyun.com/zh/ecs/user-guide/disks-2
That is to say, the maximum throughput (read+write upper limit) is 140MB/s

Question

Due to being relatively poor, the deployment of the cluster is quite complex. There is a Kafka cluster, a ZooKeeper cluster, and two Elasticsearch clusters on these 7 machines.

Among them, Kafka and ZooKeeper clusters are deployed on 193, 194, and 195;
Two Elasticsearch clusters each have 7 nodes (meaning each of these 7 machines has two instances of Elasticsearch clusters);

The Kafka, ZooKeeper, and Elasticsearch clusters are all in use for the three disks on 193, 194, and 195.

Prior to this accident, there was a write delay in the Kafka cluster. It was analyzed that two Elasticsearch clusters shared a disk with the Kafka cluster, causing the disk IO to become full, resulting in the delay in the Kafka cluster.

So adjust the use of the three disks on 193, 194, and 195 to:

A 4T disk for Kafka and ZooKeeper to use separately
Another two 4T disks are mixed with two Elasticsearch clusters (the data volume of one Elasticsearch cluster is already very small, less than 10GB, so it can be ignored)
Alright, the previous review has been completed. Now let's talk about the issues that occurred this time:
At 5pm on the 23rd, the Elasticsearch cluster hung up
The cluster does not have a master
At 9am on the 25th, the Elasticsearch cluster resumed rebalance shards

Analysis

Because the issue on the 23rd was very urgent, the solution taken at that time was to restart the entire cluster and restore it first.
At that time, there were no master exception logs on 193

{"type": "server", "timestamp": "2023-10-23T09:22:08955Z", "level": "WARN", "component": "o.e.c.c. ClusterFormationFailureHelper", "cluster. name": "business log", "node. name": "es-b-193", "message": "master not discovered or selected yet, an election requires at least 4 nodes with ids from [hYrrmhLHTx QHoDmZ2wATg, jaNLhd1eT6SUYcLJkHpE1Q, 1VQFmt9jQ-6d7fCMjP-vnQ, uzGdH3VeRbOlcDIWeKdgIw, PhjlCea6TNKh4rdZPyPDkA, ZFQdNP4HSgCtPCnMfDdopw, 1WRaRyU-SuCPn8qbz3e4hg], Have discovered [{es-b-193} {ZFQdNP4HSgCtPCnMfDdopw} {k0x5Rog-S0CFhAmorJ3V0Q} {127.0.204.193} {127.0.204.193:9301} {cdfhilmrstw}, {es-b-221} {jaNLhd1eT6SUYcLJkHpE1Q} {XkD-b14XQzq1-o3EiDSuKA} {127.0.220.221} {127.0.220.21:9301} {cdfhilmrstw} PhjlCea6TNKh4rdZPyPDkA {6pQJDRevScad3hr_hJcnTg} {127.0.204.194} {127.0.204.194:9301} cdfhilmrstw}, {es-b-220} hYrrmhLHTx QHoDmZ2wATg} {Qyz9MF2bRmKQq9uPWgTBuw} {127.0.220:220} {127.0.220:220:9301} {cdfhilmrstw}, {es-b-195} {1VQFmt9jQ-6d7fCMjP-vnQ} {eP2NNXnTIWGzg7vE8EWkA} {127.0.204.195} {127.0.204.195:9301} {cdfhilmrstw}, {es-b-74} {uzGdH3VeRbOlcDIWeKdgIW} fhNTddq7Syi7zPDVCejDWQ} {127.0.220.74 {127.0.220.74:9301} {cdfhilmrstw}, {es-b-73} {1WRaRyU-SuCPn8qbz3e4hg} {1iRrVMWcSNGGxsuQYxePSg} {127.0.220.73} {127.0.220.73:9301} {cdfhilmrstw}] which is a quorum; Discovery will continue using [127.0.204.194:9301, 127.0.204.195:9301] from hosts providers and [{es-b-220} hYrrmhLHTx QHoDmZ2wATg} {Qyz9MF2bRmKQq9uPWgTBuw} {127.0.220:20:9301} cdfhilmrstw}, {es-b-73} {1WRaRyU-SuCPn8qbz3e4hg} {1iRrVMWcSNG GxsuQYxePSg} {127.0.220.73} {127.0.220.73:9301} {cdfhilmrstw}, {es-b-193} {ZFQdNP4HSgCtPCnMfDdopw} {k0x5Rog-S0CFhAmorJ3V0Q} {127.0.204.193} {127.0.204.193:9301} cdfhilmrstw}, {es-b-74} uzGdH3VeRbOlcDIWeKdgIW} fhNTddq7Syi7zPDVCejDWQ} {127.0.220.74} {127.0.220.74:9301} cdfhilmrstw}, {es-b-194} PhjlCea6TNKh4rdZPyPDkA} 6pQJDRevScad3hr_hJcnTg} {127.0.204.194} {127.0.204.194:9301} {cdfhilmrstw}, {es-b-221} {jaNLhd1eT6SUYcLJkHpE1Q} {XkD-b14XQzq1-o3EiDSuKA} {127.0.220.221} {127.0.220.221: 9301} cdfhilmrstw}, {es-b-195} {1VQFmt9jQ-6d7fCMjP-vnQ} {e_P2NNXnTIWGzg7vE8EWkA} {127.0.204.195} {127.0.204.195:9301} {cdfhilmrstw}] from last known cluster state; Node term 43, last accepted version 397615 in term 43 "," cluster. uuid ":" ArYY qmCTbCQTDUI8ogsBg "," node. id ":" ZFQdNP4HSgCtPCnMfDdopw "}
{"type": "server", "timestamp": "2023-10-23T09:22:111119Z", "level": "WARN", "component": "r. suppressed", "cluster. name": "business log", "node. name": "es-b-193", "message": "path:/_cat/nodes, params: {h=ip, name, heap. percent, heap. current, heap. max, ram. percent, ram. current, ram. max, ram. node. role, master, CPU, load_1m, load_5m, load_15m, disk. used_percent, disk. used, disk. total}", Cluster. uuid ":" ArYy qmCTbCQTDUI8ogsBg "," node. id ":" ZFQdNP4HSgCtPCnMfDdopw ",
Org. lasticsearch. cluster. block. ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master]\ ,  n "," stream ":" stdout "," time ":" 2023-10-23T09:21:34:490903538Z "}
{"log": "" at org. lasticsearch. cluster. block. ClusterBlocks. globalBlockedException (ClusterBlocks. java: 179

After recovery, the main phenomenon observed in troubleshooting logs is frequent addition and removal of node 193

Added {es-b-193} {ZFQdNP4HSgCtPCnMfDdopw} {k0x5Rog-S0CFhAmorJ3V0Q} {127.0.204.193} {127.0.204.193:9301} {cdfhilmrstw}}, term: 43, version: 397620, Reason: ApplyCommitRequest {term=43, version=397620, sourceNode={es-b-73} {1WRaRyU-SuCPn8qbz3e4hg} {1iRrVMWcSNGGxsuQYxePSg} {127.0.220.73} {127.0.220.73:9301} {cdfhilmrstw} {ml. machine_memory=133070966784, ml. max_open_jobs=512, xpack. installed=true, ml. max_jvm_size=3328599654 4. transform. node=true}}
Removed {es-b-193} {ZFQdNP4HSgCtPCnMfDdopw} {k0x5Rog-S0CFhAmorJ3V0Q} {127.0.204.193} {127.0.204.193:9301} {cdfhilmrstw}}, term: 43, version: 397621, Reason: ApplyCommitRequest {term=43, version=397621, sourceNode={es-b-73} {1WRaRyU-SuCPn8qbz3e4hg} {1iRrVMWcSNGGxsuQYxePSg} {127.0.220.73} {127.0.220.73:9301} {cdfhilmrstw} {ml. machine_memory=133070966784, ml. max_open_jobs=512, xpack. installed=true, ml. max_jvm_size=3328599654 4. transform. node=true}}

But the CPU, memory, and disk IO loads of these nodes are not very high, which is relatively normal

I don't understand why 193 is frequently added and removed？

I found the following logs on master node 194

root@jiankunking-es-02:~# docker logs es-b-194 | grep node-left
{"type": "server", "timestamp": "2023-10-23T09:33:18,042Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{127.0.220.221}{127.0.220.221:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 399161, delta: removed {{es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{127.0.220.221}{127.0.220.221:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }
{"type": "server", "timestamp": "2023-10-25T01:52:35,140Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw} reason: followers check retry count exceeded], term: 52, version: 404805, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }
{"type": "server", "timestamp": "2023-10-25T01:53:24,564Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 404807, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }
{"type": "server", "timestamp": "2023-10-25T01:53:30,879Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 404810, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }
{"type": "server", "timestamp": "2023-10-25T02:11:44,481Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw} reason: followers check retry count exceeded], term: 52, version: 405235, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }
{"type": "server", "timestamp": "2023-10-25T03:22:03,390Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw} reason: followers check retry count exceeded], term: 52, version: 407007, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{127.0.204.193}{127.0.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }

jiankunking · October 31, 2023, 5:19am

cluster too many shards

Christian_Dahlqvist · October 31, 2023, 5:58am

Yes, having a lot of small shards is not recommended. Note that the handling of larger shard counts have been improved in Elasticsearch 8.3, as indicated in this blog post. In addition to greatly reducing the shard count I would therefore also recommend upgrading to the latest version. I would guess that close to a million shards would be too large even for the latest version though.

This post and this article may also be useful when addressing the issue.

jiankunking · November 2, 2023, 1:23am

一次不接受ElasticSearch官方建议导致的事故-CSDN博客

jiankunking · November 2, 2023, 1:27am

Thank you for your reply. The amount of data written in our scenario is unpredictable, so the shard number is uniformly set to 7, and we are preparing to adjust it to 3 later to reduce the probability of this problem occurring.

Have time to prepare to research the datastream and see if it can solve our problem.

system · November 30, 2023, 1:28am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic Cluster suddenly lost node Elasticsearch	1	19	September 4, 2024
Elasticsearch node crashed Elasticsearch	5	655	August 3, 2022
Is it okay to make 7 nodes of cluster in one machine(server)? Elasticsearch	11	360	July 21, 2021
Nodes disconnected randomly Elasticsearch painless	1	310	September 19, 2022
Elastic Search Random Node High Load Elasticsearch	3	457	July 6, 2017

A node in the Elasticsearch cluster is frequently removed and added

Background Introduction

Question

Analysis

Related topics