I have a cluster with 3 master, 40 data nodes (d1,d2,...,d40).
First 5 data nodes have voting only master role.
Only the following data nodes have periodic abnormal behavior:
These nodes disconnect from cluster every ~65 mins and rejoin after ~10 mins.
This chart for node counts by time (Jul 1-11):
High values indicate the number of nodes that should be (40), low values after the nodes are disconnected (31).
This chart for minute difference between node counts change times (Jul 1-11):
High values (~65 mins) show the time they are connected, low times (~11 mins) indicate the time they are disconnected.
elasticsearch.yml and jvm.options files are same for all data nodes.
Before disconnecting, the following error log occurs in master node log file:
[2023-07-12T00:10:12,946][ERROR][o.e.x.m.c.i.IndexStatsCollector] [m01] collector [index-stats] timed out when collecting data: nodes [Hs20tBbARLmfVIwfl_uq6g, aZFlTvfKR3KgoAwa9gHdLA, 4_Lx62u9Qsqwbzwz0A496Q, OJQBrR0URo2R95j7epmyag, CT0jbNdlQsypPonefjrrVw, wDIuVUurTZyfTbd-KZUkaw, TYepUP6qQpWbpXZQax8K5Q, wfUp9qdXQsqysmrQ1Bsl6A, UdKxMlRcTASjVujQr9EM4w] did not respond within [10s]
Config files and stats are here.
Note: In elasticsearch.yml file, discovery.seed_hosts value have 48 item, because i'm planning to add 5 new data nodes but installation not completed.
I'm collecting node counts for every mins. You can see this data in number_of_data_nodes_2023.07.log file.
Also i have disconnected nodes list by minute in disconnected_nodes_2023.07.log file.
Where should i check to fix this problem?