[node2] collector [cluster_stats] timed out when collecting data

Hi everyone,

I am trying to reindex my data and the ES Cluster turned to RED.

The error in master node's log:

[2019-10-23T16:24:00,179][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node2] collector [cluster_stats] timed out when collecting data
[2019-10-23T16:24:12,903][WARN ][o.e.c.InternalClusterInfoService] [node2] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2019-10-23T16:24:40,153][INFO ][o.e.c.s.MasterService ] [node2] zen-disco-node-failed({node5}{SPKWfjBDS3OZx89CCGIMWA}{JMzdy2BNSAGugxDuGIMX8A}{172.16.3.84}{172.16.3.84:9300}{ml.machine_memory=67436204032, disk=normal, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{node5}{SPKWfjBDS3OZx89CCGIMWA}{JMzdy2BNSAGugxDuGIMX8A}{172.16.3.84}{172.16.3.84:9300}{ml.machine_memory=67436204032, disk=normal, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed {{node5}{SPKWfjBDS3OZx89CCGIMWA}{JMzdy2BNSAGugxDuGIMX8A}{172.16.3.84}{172.16.3.84:9300}{ml.machine_memory=67436204032, disk=normal, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}
[2019-10-23T16:24:46,960][INFO ][o.e.c.s.ClusterApplierService] [node2] removed {{node5}{SPKWfjBDS3OZx89CCGIMWA}{JMzdy2BNSAGugxDuGIMX8A}{172.16.3.84}{172.16.3.84:9300}{ml.machine_memory=67436204032, disk=normal, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {node2}{Gy_SbbWKTSS213NYmAshsQ}{wJwenmHLTme7fH9fGrSJdA}{172.16.30.92}{172.16.30.92:9300}{ml.machine_memory=33437806592, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [3320] source [zen-disco-node-failed({node5}{SPKWfjBDS3OZx89CCGIMWA}{JMzdy2BNSAGugxDuGIMX8A}{172.16.3.84}{172.16.3.84:9300}{ml.machine_memory=67436204032, disk=normal, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{node5}{SPKWfjBDS3OZx89CCGIMWA}{JMzdy2BNSAGugxDuGIMX8A}{172.16.3.84}{172.16.3.84:9300}{ml.machine_memory=67436204032, disk=normal, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} failed to ping, tried [3] times, each with maximum [30s] timeout]]])
[2019-10-23T16:24:57,903][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [node2] failed to execute on node [SPKWfjBDS3OZx89CCGIMWA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node5][172.16.3.84:9300][cluster:monitor/nodes/stats[n]] request_id [1114453] timed out after [15006ms]

My Elasticsearch version is 6.8.0.

I have 5 nodes:
node2 have total 31G memory and 15G for elasticsearch.
node3 and node4 have total 30G memory and 15G for elasticsearch, node3 and node4 have SSD.
node5 and node6 have total 64G memory and 30G for elasticsearch.

node2 is master node.

The cluster have 1431 indices, 3340 primary shards and 1763 replica shards.

The RED cluster state happened because the master node struggled to collect shard information from all nodes; after three failed ping requests to node5 it gave up collecting the shard info and removed the node from the cluster.

The reason it struggles to collect the cluster information is probably a combination of your reindexing operation and the fact that you have far too many shards in your cluster.

According to this official blog post:

A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured.

From the numbers you've provided your 5-node cluster has a total Java heap size of 105 GB (2 x 30 + 3 x 15) which means you should aim for less than 105 x 20 = 2100 shards in your cluster. You currently have primary + replica 5103 shards.

So before you continue to reindex you really should look at reducing the number shards. Ideally the shard sizes should be in the 20-40 GB range so you may have to combine many small indices into fewer large ones with less shards.

Good luck!

@Bernt_Rostad
The purpose of reindex is to reduce the number of cluster shards :pensive:
By the way, the total size of my cluster is 18T, and 1052030=63T.
So I think the performance of ES should not be too bad, I will continue to reduce the number of cluster shards.
Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.