Random node frequently removed from the cluster

I found different node removed from cluster in last month, sometimes it happens 10 times a day. ES version is 7.6.2, and there are 5000+ shards, 6 data nodes with 30gb heap size, and another 3 master node in the cluster.
I can find different reason when I grep removed on Elasticsearch logs, and there are three reason I can get,

  • lagging
  • disconnected
  • followers check retry count exceeded

I think the reason maybe there are two many shards in the cluster, but I cannot delete the index. I try to increase the default setting of discovery.zen.ping_timeout, discovery.zen.ping_retries, it seems that doesn't work.
And I have some questions,

  1. the removed reason disconnected is because the ping mechanism between nodes right? So I think when I increase the default setting of discovery.zen.ping_timeout, discovery.zen.ping_retries as bigger as I can, there will be no more disconnected reason in removed right?
  2. the removed reason followers check, Can I just increase cluster.fault_detection.follower_check.timeout and cluster.fault_detection.follower_check.retry_count to solve this one?
  3. the removed reason lagging, Can I increase cluster.follower_lag.timeout and cluster.publish.timeout to solve it?
  4. the discuss solution said

some better handling of nodes with slower disks which is the usual root cause of lagging nodes.
How can I check if it is slower disk cause the problem?

  1. If there are other ways to check and verify, please let me know, any one of opinions is very important to me

there are two many logs, I didn't know which ones are useful, If need some logs to analyze please tell me how can I pick the useful one to paste here

That version is very old and has passed EOL so you should upgrade as a matter of urgency. You're missing out on over a year and a half of development work, some of which has an impact on cluster stability and resilience. 5000 shards across 6 data nodes is quite a lot and you should probably try and reduce that number, but it shouldn't destabilise the cluster.

No, disconnected means a TCP connection closed which is immediately fatal: there are no retries in this case.

Kinda, but these timeouts are already pretty generous, if you permit nodes to be unresponsive for longer then you'll just get other things like client requests that time out instead. As per the docs:

WARNING: If you adjust these settings then your cluster may not form correctly or may become unstable or intolerant of certain failures.

Again, kinda, but this timeout is already pretty generous, you'll get other weird behaviour if you permit nodes to take many minutes to update the cluster state. The same warning applies to these settings:

WARNING: If you adjust these settings then your cluster may not form correctly or may become unstable or intolerant of certain failures.

Thanks for your answers and suggestions David, and I'll have a try with these settings and upgrade the cluster if none of these work.

TBC I think the first thing you should try is upgrading the cluster. If that doesn't work then it would be best to ask again here: adjusting the settings you mentioned is not at all recommended, it won't fix the underlying issues so you'll just experience different problems.

Yeah, you're right David, I should upgrade to the 7.15 as soon as I can. I'll try to rolling upgrade the cluster, thank you. And I want to consult you what kind of reason can lead to "disconnected", as you said an immediate TCP connection close, because I didn't seen these problems when there are fewer data months ago.

Generally it means something between the nodes caused a connection to close. However there is an issue (fixed in 7.16.0 by Improve control of outgoing connection lifecycles by DaveCTurner · Pull Request #77295 · elastic/elasticsearch · GitHub) where an otherwise unstable cluster can sometimes see a node leave because disconnected shortly after the same node failed for another reason and then rejoined. If that applies to you then you can fix it by fixing the other reasons for the instability.

I'm so sorry that I ignore one thing, with the disconnected logs time, the corresponding leave node's log show that there are [gc][118562] overhead, spent [38.8s] collecting in the last [39.6s], and it's too long. So is it the reason why the data node disconnected?

A long GC doesn't explain disconnected, no, but wow 38s is a very long GC pause, definitely long enough to cause nodes to drop out of the cluster. How large is the heap on this node? Did you disable swap?

Yes, I disable swap on all nodes, the heap size of the 6 data node are all 30gb. I'm going to replace CMS with G1 to see if it gets better.

What kind of hardware are the nodes deployed on? If it is VMs, can you check whether memory ballooning is used?

Physical machine, no memory ballooning is used.