I noticed a new problem.
This problem started when I have 40 data nodes and more in the elasticsearch cluster.
When on a cluster you are doing a larger scale, e.g. recovery shards (let's assume you are doing some reroute or on indexes I am changing the number of replicas from 0 to 1), elasticsearch date node starts to disappear from my cluster (for some time they are offline).
In an operating system on node elasticsearch it works all the time - there is no reboot etc.
I can see, however, that during intensive operations there are some problems.
These problems did not occur up to 25 data nodes.
Logs show that this is a problem with master node.
I don't have a very big load on master node.
After adding more data node load has increased.
But the load is still not very big - I think it is acceptable.
But in logs you can see information about master timeout.
I paste in the log:
[2020-05-17T07:28:55,818][INFO ][o.e.c.c.JoinHelper ] [CUSTER-DATA-025] failed to join {CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true} with JoinRequest{sourceNode={CUSTER-DATA-025}{83F8BYx-QzCXdSaA27xxjA}{uFZHaSohSa-4CIaYIWVH7g}{CUSTER-DATA-025}{192.168.100.130:9300}{di}{xpack.installed=true, data=warm, machine_id=M001}, optionalJoin=Optional[Join{term=30, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={CUSTER-DATA-025}{83F8BYx-QzCXdSaA27xxjA}{uFZHaSohSa-4CIaYIWVH7g}{CUSTER-DATA-025}{192.168.100.130:9300}{di}{xpack.installed=true, data=warm, machine_id=M001}, targetNode={CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [CUSTER-MAST-001][192.168.100.44:9300][internal:cluster/coordination/join] request_id [234019] timed out after [59850ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.5.1.jar:7.5.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-05-17T07:28:57,954][INFO ][o.e.c.s.ClusterApplierService] [CUSTER-DATA-025] master node changed {previous [], current [{CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}]}, removed {{CUSTER-DATA-026}{XiPeGA0qTbqawHa8bLlKQQ}{0bocFQaEQZOzYx1gEL8p-w}{CUSTER-DATA-026}{192.168.100.131:9300}{di}{data=warm, machine_id=M001, xpack.installed=true}}, term: 30, version: 220943, reason: ApplyCommitRequest{term=30, version=220943, sourceNode={CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}}
[2020-05-17T07:29:08,152][WARN ][o.e.t.TransportService ] [CUSTER-DATA-025] Received response for a request that has timed out, sent [72350ms] ago, timed out [12500ms] ago, action [internal:cluster/coordination/join], node [{CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}], id [234019]
[2020-05-17T07:29:12,633][INFO ][o.e.c.s.ClusterApplierService] [CUSTER-DATA-025] added {{CUSTER-DATA-026}{XiPeGA0qTbqawHa8bLlKQQ}{0bocFQaEQZOzYx1gEL8p-w}{CUSTER-DATA-026}{192.168.100.131:9300}{di}{data=warm, machine_id=M001, xpack.installed=true}}, term: 30, version: 220945, reason: ApplyCommitRequest{term=30, version=220945, sourceNode={CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}}
[2020-05-17T07:31:16,086][WARN ][o.e.g.IncrementalClusterStateWriter] [CUSTER-DATA-025] writing cluster state took [10208ms] which is above the warn threshold of [10s]; wrote metadata for [0] indices and skipped [2] unchanged indices
[2020-05-17T07:32:06,137][INFO ][o.e.c.s.ClusterApplierService] [CUSTER-DATA-025] removed {{CUSTER-DATA-001}{odNqSKDvSZOxnusPXXf1Bg}{4F2dGWZ9Q6mQlNfhlg3ODQ}{CUSTER-DATA-001}{192.168.100.50:9300}{d}{machine_id=M001, xpack.installed=true},{CUSTER-DATA-031}{0SeGef9KSjiXb0WimSzHPQ}{4awLr9yVRk6-VP2vGlq0Xw}{CUSTER-DATA-031}{192.168.100.136:9300}{di}{data=warm, machine_id=M001, xpack.installed=true}}, term: 30, version: 220951, reason: ApplyCommitRequest{term=30, version=220951, sourceNode={CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}}
[2020-05-17T07:32:19,357][WARN ][o.e.g.IncrementalClusterStateWriter] [CUSTER-DATA-025] writing cluster state took [11841ms] which is above the warn threshold of [10s]; wrote metadata for [0] indices and skipped [2] unchanged indices
I guess this is a scale problem.
Where is the limit? Could the performance of the network infrastructure be limited here?