[node2] collector [cluster_stats] timed out when collecting data

yeziblo · October 23, 2019, 8:54am

Hi everyone,

I am trying to reindex my data and the ES Cluster turned to RED.

The error in master node's log:

[2019-10-23T16:24:00,179][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node2] collector [cluster_stats] timed out when collecting data
[2019-10-23T16:24:12,903][WARN ][o.e.c.InternalClusterInfoService] [node2] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2019-10-23T16:24:40,153][INFO ][o.e.c.s.MasterService ] [node2] zen-disco-node-failed({node5}{SPKWfjBDS3OZx89CCGIMWA}{JMzdy2BNSAGugxDuGIMX8A}{172.16.3.84}{172.16.3.84:9300}{ml.machine_memory=67436204032, disk=normal, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{node5}{SPKWfjBDS3OZx89CCGIMWA}{JMzdy2BNSAGugxDuGIMX8A}{172.16.3.84}{172.16.3.84:9300}{ml.machine_memory=67436204032, disk=normal, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed {{node5}{SPKWfjBDS3OZx89CCGIMWA}{JMzdy2BNSAGugxDuGIMX8A}{172.16.3.84}{172.16.3.84:9300}{ml.machine_memory=67436204032, disk=normal, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}
[2019-10-23T16:24:46,960][INFO ][o.e.c.s.ClusterApplierService] [node2] removed {{node5}{SPKWfjBDS3OZx89CCGIMWA}{JMzdy2BNSAGugxDuGIMX8A}{172.16.3.84}{172.16.3.84:9300}{ml.machine_memory=67436204032, disk=normal, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {node2}{Gy_SbbWKTSS213NYmAshsQ}{wJwenmHLTme7fH9fGrSJdA}{172.16.30.92}{172.16.30.92:9300}{ml.machine_memory=33437806592, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [3320] source [zen-disco-node-failed({node5}{SPKWfjBDS3OZx89CCGIMWA}{JMzdy2BNSAGugxDuGIMX8A}{172.16.3.84}{172.16.3.84:9300}{ml.machine_memory=67436204032, disk=normal, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{node5}{SPKWfjBDS3OZx89CCGIMWA}{JMzdy2BNSAGugxDuGIMX8A}{172.16.3.84}{172.16.3.84:9300}{ml.machine_memory=67436204032, disk=normal, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} failed to ping, tried [3] times, each with maximum [30s] timeout]]])
[2019-10-23T16:24:57,903][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [node2] failed to execute on node [SPKWfjBDS3OZx89CCGIMWA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node5][172.16.3.84:9300][cluster:monitor/nodes/stats[n]] request_id [1114453] timed out after [15006ms]

My Elasticsearch version is 6.8.0.

I have 5 nodes:
node2 have total 31G memory and 15G for elasticsearch.
node3 and node4 have total 30G memory and 15G for elasticsearch, node3 and node4 have SSD.
node5 and node6 have total 64G memory and 30G for elasticsearch.

node2 is master node.

The cluster have 1431 indices, 3340 primary shards and 1763 replica shards.

Bernt_Rostad · October 23, 2019, 9:20am

The RED cluster state happened because the master node struggled to collect shard information from all nodes; after three failed ping requests to node5 it gave up collecting the shard info and removed the node from the cluster.

The reason it struggles to collect the cluster information is probably a combination of your reindexing operation and the fact that you have far too many shards in your cluster.

According to this official blog post:

A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured.

From the numbers you've provided your 5-node cluster has a total Java heap size of 105 GB (2 x 30 + 3 x 15) which means you should aim for less than 105 x 20 = 2100 shards in your cluster. You currently have primary + replica 5103 shards.

So before you continue to reindex you really should look at reducing the number shards. Ideally the shard sizes should be in the 20-40 GB range so you may have to combine many small indices into fewer large ones with less shards.

Good luck!

yeziblo · October 23, 2019, 9:41am

@Bernt_Rostad
The purpose of reindex is to reduce the number of cluster shards
By the way, the total size of my cluster is 18T, and 1052030=63T.
So I think the performance of ES should not be too bad, I will continue to reduce the number of cluster shards.
Thanks!

system · November 20, 2019, 9:41am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Collector [node_stats] timed out when collecting data Elasticsearch	2	663	March 25, 2019
Collector [cluster_stats] timed out when collecting data: node Elasticsearch	4	781	December 27, 2022
Es5.4.3 timed out when collecting data Elasticsearch	3	1610	November 14, 2017
[2019-03-11T12:09:52,460][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-2] collector [cluster_stats] timed out when collecting data Elasticsearch elastic-stack-monitoring	2	380	April 8, 2019
ES 5.4.0 - Collector Timed Out and Nodes Disconnected Elasticsearch	1	915	January 17, 2018

[node2] collector [cluster_stats] timed out when collecting data

Related topics