Elasticsearch - a scale problem? master timeout?

Rysiu · May 17, 2020, 7:58am

I noticed a new problem.
This problem started when I have 40 data nodes and more in the elasticsearch cluster.

When on a cluster you are doing a larger scale, e.g. recovery shards (let's assume you are doing some reroute or on indexes I am changing the number of replicas from 0 to 1), elasticsearch date node starts to disappear from my cluster (for some time they are offline).

In an operating system on node elasticsearch it works all the time - there is no reboot etc.

I can see, however, that during intensive operations there are some problems.

These problems did not occur up to 25 data nodes.

Logs show that this is a problem with master node.

I don't have a very big load on master node.
After adding more data node load has increased.
But the load is still not very big - I think it is acceptable.

But in logs you can see information about master timeout.

I paste in the log:

[2020-05-17T07:28:55,818][INFO ][o.e.c.c.JoinHelper       ] [CUSTER-DATA-025] failed to join {CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true} with JoinRequest{sourceNode={CUSTER-DATA-025}{83F8BYx-QzCXdSaA27xxjA}{uFZHaSohSa-4CIaYIWVH7g}{CUSTER-DATA-025}{192.168.100.130:9300}{di}{xpack.installed=true, data=warm, machine_id=M001}, optionalJoin=Optional[Join{term=30, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={CUSTER-DATA-025}{83F8BYx-QzCXdSaA27xxjA}{uFZHaSohSa-4CIaYIWVH7g}{CUSTER-DATA-025}{192.168.100.130:9300}{di}{xpack.installed=true, data=warm, machine_id=M001}, targetNode={CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [CUSTER-MAST-001][192.168.100.44:9300][internal:cluster/coordination/join] request_id [234019] timed out after [59850ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.5.1.jar:7.5.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-05-17T07:28:57,954][INFO ][o.e.c.s.ClusterApplierService] [CUSTER-DATA-025] master node changed {previous [], current [{CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}]}, removed {{CUSTER-DATA-026}{XiPeGA0qTbqawHa8bLlKQQ}{0bocFQaEQZOzYx1gEL8p-w}{CUSTER-DATA-026}{192.168.100.131:9300}{di}{data=warm, machine_id=M001, xpack.installed=true}}, term: 30, version: 220943, reason: ApplyCommitRequest{term=30, version=220943, sourceNode={CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}}
[2020-05-17T07:29:08,152][WARN ][o.e.t.TransportService   ] [CUSTER-DATA-025] Received response for a request that has timed out, sent [72350ms] ago, timed out [12500ms] ago, action [internal:cluster/coordination/join], node [{CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}], id [234019]
[2020-05-17T07:29:12,633][INFO ][o.e.c.s.ClusterApplierService] [CUSTER-DATA-025] added {{CUSTER-DATA-026}{XiPeGA0qTbqawHa8bLlKQQ}{0bocFQaEQZOzYx1gEL8p-w}{CUSTER-DATA-026}{192.168.100.131:9300}{di}{data=warm, machine_id=M001, xpack.installed=true}}, term: 30, version: 220945, reason: ApplyCommitRequest{term=30, version=220945, sourceNode={CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}}
[2020-05-17T07:31:16,086][WARN ][o.e.g.IncrementalClusterStateWriter] [CUSTER-DATA-025] writing cluster state took [10208ms] which is above the warn threshold of [10s]; wrote metadata for [0] indices and skipped [2] unchanged indices
[2020-05-17T07:32:06,137][INFO ][o.e.c.s.ClusterApplierService] [CUSTER-DATA-025] removed {{CUSTER-DATA-001}{odNqSKDvSZOxnusPXXf1Bg}{4F2dGWZ9Q6mQlNfhlg3ODQ}{CUSTER-DATA-001}{192.168.100.50:9300}{d}{machine_id=M001, xpack.installed=true},{CUSTER-DATA-031}{0SeGef9KSjiXb0WimSzHPQ}{4awLr9yVRk6-VP2vGlq0Xw}{CUSTER-DATA-031}{192.168.100.136:9300}{di}{data=warm, machine_id=M001, xpack.installed=true}}, term: 30, version: 220951, reason: ApplyCommitRequest{term=30, version=220951, sourceNode={CUSTER-MAST-001}{WPLMpWl2SpGN95Er2zsniA}{wqECBy0GQKCxvXNnNWOSNg}{CUSTER-MAST-001}{192.168.100.44:9300}{m}{machine_id=M004, xpack.installed=true}}
[2020-05-17T07:32:19,357][WARN ][o.e.g.IncrementalClusterStateWriter] [CUSTER-DATA-025] writing cluster state took [11841ms] which is above the warn threshold of [10s]; wrote metadata for [0] indices and skipped [2] unchanged indices

I guess this is a scale problem.
Where is the limit? Could the performance of the network infrastructure be limited here?

Christian_Dahlqvist · May 17, 2020, 8:55am

If you look at the logs across the cluster, do you see any evidence of long and/or frequent GC? How many indices and shards do you have in the cluster?

It also looks like it is taking a long time to write cluster state to disk, which could indicate that your storage is too slow and has become a bottleneck. What type of hardware and storage does the clsuter use?

Rysiu · May 17, 2020, 9:42am

But in this case, storage bootleneck on DATA-025?
Could it be a problem with other nodes?

The DATA-025 node has its own/dedicated 7.2k SAS hard drive and 64 GB RAM (total). Network eth/1G.

The slowlog on DATA-025 is empty.

Total on cluster: ~150 indices and ~2000 shards.

DavidTurner · May 17, 2020, 10:14am

Writing the cluster state to disk should take a handful of milliseconds, but is actually taking over 10 seconds. This suggests your disks are overloaded or otherwise too slow, and if they're slow enough then nodes will be considered unhealthy and may be removed from the cluster.

Similarly, this suggests that the master took over a minute to process the join of this node, which should normally take milliseconds, and a common cause of this is slow disks.

You may see some benefits in an upgrade to 7.6+ as this has some changes that consolidate the IO done during cluster state updates that may help if your disks are just a bit slow.

system · June 14, 2020, 10:14am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Huge ES query logging issue Elasticsearch	1	303	November 19, 2020
Elasticsearch cluster instability Elasticsearch	13	2821	July 6, 2017
Master node hangs when multiple data nodes are shutdown at the same time Elasticsearch	6	954	July 6, 2017
Cluster failures Elasticsearch	2	284	July 6, 2017
Timeout Elasticsearch	4	900	July 6, 2017

Elasticsearch - a scale problem? master timeout?

Related topics