Elasticsearch - a scale problem? master timeout?

I noticed a new problem.
This problem started when I have 40 data nodes and more in the elasticsearch cluster.

When on a cluster you are doing a larger scale, e.g. recovery shards (let's assume you are doing some reroute or on indexes I am changing the number of replicas from 0 to 1), elasticsearch date node starts to disappear from my cluster (for some time they are offline).

In an operating system on node elasticsearch it works all the time - there is no reboot etc.

I can see, however, that during intensive operations there are some problems.

These problems did not occur up to 25 data nodes.

Logs show that this is a problem with master node.

I don't have a very big load on master node.
After adding more data node load has increased.
But the load is still not very big - I think it is acceptable.

But in logs you can see information about master timeout.

I paste in the log:

[2020-05-17T07:28:55,818][INFO ][o.e.c.c.JoinHelper       ] [CUSTER-DATA-025] failed to join {CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true} with JoinRequest{sourceNode={CUSTER-DATA-025}{83F8BYx-QzCXdSaA27xxjA}{uFZHaSohSa-4CIaYIWVH7g}{CUSTER-DATA-025}{192.168.100.130:9300}{di}{xpack.installed=true, data=warm, machine_id=M001}, optionalJoin=Optional[Join{term=30, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={CUSTER-DATA-025}{83F8BYx-QzCXdSaA27xxjA}{uFZHaSohSa-4CIaYIWVH7g}{CUSTER-DATA-025}{192.168.100.130:9300}{di}{xpack.installed=true, data=warm, machine_id=M001}, targetNode={CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [CUSTER-MAST-001][192.168.100.44:9300][internal:cluster/coordination/join] request_id [234019] timed out after [59850ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.5.1.jar:7.5.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-05-17T07:28:57,954][INFO ][o.e.c.s.ClusterApplierService] [CUSTER-DATA-025] master node changed {previous [], current [{CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}]}, removed {{CUSTER-DATA-026}{XiPeGA0qTbqawHa8bLlKQQ}{0bocFQaEQZOzYx1gEL8p-w}{CUSTER-DATA-026}{192.168.100.131:9300}{di}{data=warm, machine_id=M001, xpack.installed=true}}, term: 30, version: 220943, reason: ApplyCommitRequest{term=30, version=220943, sourceNode={CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}}
[2020-05-17T07:29:08,152][WARN ][o.e.t.TransportService   ] [CUSTER-DATA-025] Received response for a request that has timed out, sent [72350ms] ago, timed out [12500ms] ago, action [internal:cluster/coordination/join], node [{CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}], id [234019]
[2020-05-17T07:29:12,633][INFO ][o.e.c.s.ClusterApplierService] [CUSTER-DATA-025] added {{CUSTER-DATA-026}{XiPeGA0qTbqawHa8bLlKQQ}{0bocFQaEQZOzYx1gEL8p-w}{CUSTER-DATA-026}{192.168.100.131:9300}{di}{data=warm, machine_id=M001, xpack.installed=true}}, term: 30, version: 220945, reason: ApplyCommitRequest{term=30, version=220945, sourceNode={CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}}
[2020-05-17T07:31:16,086][WARN ][o.e.g.IncrementalClusterStateWriter] [CUSTER-DATA-025] writing cluster state took [10208ms] which is above the warn threshold of [10s]; wrote metadata for [0] indices and skipped [2] unchanged indices
[2020-05-17T07:32:06,137][INFO ][o.e.c.s.ClusterApplierService] [CUSTER-DATA-025] removed {{CUSTER-DATA-001}{odNqSKDvSZOxnusPXXf1Bg}{4F2dGWZ9Q6mQlNfhlg3ODQ}{CUSTER-DATA-001}{192.168.100.50:9300}{d}{machine_id=M001, xpack.installed=true},{CUSTER-DATA-031}{0SeGef9KSjiXb0WimSzHPQ}{4awLr9yVRk6-VP2vGlq0Xw}{CUSTER-DATA-031}{192.168.100.136:9300}{di}{data=warm, machine_id=M001, xpack.installed=true}}, term: 30, version: 220951, reason: ApplyCommitRequest{term=30, version=220951, sourceNode={CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}}
[2020-05-17T07:32:19,357][WARN ][o.e.g.IncrementalClusterStateWriter] [CUSTER-DATA-025] writing cluster state took [11841ms] which is above the warn threshold of [10s]; wrote metadata for [0] indices and skipped [2] unchanged indices

I guess this is a scale problem.
Where is the limit? Could the performance of the network infrastructure be limited here?

If you look at the logs across the cluster, do you see any evidence of long and/or frequent GC? How many indices and shards do you have in the cluster?

It also looks like it is taking a long time to write cluster state to disk, which could indicate that your storage is too slow and has become a bottleneck. What type of hardware and storage does the clsuter use?

But in this case, storage bootleneck on DATA-025?
Could it be a problem with other nodes?

The DATA-025 node has its own/dedicated 7.2k SAS hard drive and 64 GB RAM (total). Network eth/1G.

The slowlog on DATA-025 is empty.

Total on cluster: ~150 indices and ~2000 shards.

Writing the cluster state to disk should take a handful of milliseconds, but is actually taking over 10 seconds. This suggests your disks are overloaded or otherwise too slow, and if they're slow enough then nodes will be considered unhealthy and may be removed from the cluster.

Similarly, this suggests that the master took over a minute to process the join of this node, which should normally take milliseconds, and a common cause of this is slow disks.

You may see some benefits in an upgrade to 7.6+ as this has some changes that consolidate the IO done during cluster state updates that may help if your disks are just a bit slow.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.