Hello, everyone!
We have 15 nodes in our ES cluster, including 3 master nodes, 9 data nodes, and 3 coordinate/client nodes.
There are 3 physical hosts in the cluster, and 5 ES nodes are deployed on each host. (1 master, 1 client/coordinate, 3 data) Each host has 46 CPUs and 512 GB of RAM.
Every day 3 client nodes randomly leave the cluster and automatically join it again after 10 minutes or so. During the time of the problem, there were operations doing queries and writes, but there were not many requests and the hosts had more than enough resources, so there was no resource shortage.
We have been pinging and the network is fine, no packet loss.
Do you have any friends who have encountered similar problems?
This is the coordination node log
[2022-12-29T01:28:50,400][INFO ][o.e.d.z.ZenDiscovery ] [xxxx-001-kzx_client] master_left [{xxxx-003-kzx_master}{Qxixi5PtQbOVI9lUOz94nA}{RozHaYueRsCPj1X_4DxEuA}{xxxx.40}{xxxx.40:9300}{xpack.installed=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2022-12-29T01:28:50,401][WARN ][o.e.d.z.ZenDiscovery ] [xxxx-001-kzx_client] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout)
This is the master node log