A node "Received response for a request that has timed out, sent [24704ms] ago, timed out [9704ms] ago, action [cluster: monitor / nodes / stats [n]]," stuck entire cluster

Elasticsearch version:
2.4.4
Plugins installed: []
analysis-ik, graph, kopf, license, marvel-agent, repository-hdfs
JVM version:
ocr 1.8.0_66
OS version:
Debian 8
Description of the problem including expected versus actual behavior:
A node restores the entire cluster from time to time or rejoins the cluster after a few minutes from the cluster
Steps to reproduce:
1.Such as bulk deletion of indexes or relocating shards, and even no special operations
2.marvel see index rate down to zero
3.system monitor see disk IOPS down to zero, load is high
4.any node excute _cat/indices just stuck
5.After a few minutes past the normal
5.There are problems of the node is a specific few, two weeks before the first appearance, before the normal, one of the off the assembly line after the re-do raid after the return to normal

Provide logs (if relevant):

Received response for a request that has timed out, sent [24704ms] ago, timed out [9704ms] ago, action [cluster:monitor/nodes/stats[n]], node [{elk-edata01-104_hot}{Jj7XccXoTsa8RgcTn63AOw}{1x245}{10.x1.245:9301}{zone=hot, group=small, master=false}], id [32139009]
failed to execute on node [Jj7XccXoTsa8RgcTn63AOw]
ReceiveTimeoutTransportException[[elk-edata01-104_hot][10x.245:9301][cluster:monitor/nodes/stats[n]] request_id [32139009] timed out after [15000ms]]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:698)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

cluster state update task [shard-failed ([websvr_cachelog-2017.x.1x4][0], node[XauWrlHjT7KmZUIUrHgRCg], relocating [MInCL79VR-CGIGrHEQMGlA], [P], v[10], s[INITIALIZING], a[id=MwKQ9776RMG1gt8L2d5Y_w, rId=uSaQpLHDTRu62oXqm7TUgg], expected_shard_size[16399523701]), message [failed recovery]] took 30s above the warn threshold of 30s
 
 cluster state update task [zen-disco-receive(from master [{elk-edata02-104_master}{IMTZEFOWQk6s11vj2X5wog}{10.x.246}{10.17x46:9300}{data=false, zone=master, master=true}])] took 1.4m above the warn threshold of 30s
 
timed out waiting for all nodes to process published state [63897] (timeout [30s], pending nodes: [{elk-edata01-104_hot}{Jj7XccXoTsa8RgcTn63AOw}{10.1x1.245}{10.x31.245:9301}{zone=hot, group=small, master=false}])

[ngrtc_applog-2017.04.08][0] received shard failed for target shard [[ngrtc_applog-2017.04.08][0], node[935QVlnoQYa5yRc0iGNotg], relocating [MInCL79VR-CGIGrHEQMGlA], [P], v[37], s[INITIALIZING], a[id=vXcHYEuuSR23SGjPJxB7UA, rId=gPKCaNt7RT2KBlJVrwNXEw], expected_shard_size[19883651556]], indexUUID [H7M-a4tsSDiSAWQyYtayEA], message [failed recovery], failure [RecoveryFailedException[[ngrtc_applog-2017.04.08][0]: Recovery failed from {elk-edata03-104_hot}{MInCL79VR-CGIGrHEQMGlA}{10.170.31.247}{10.170.31.247:9300}{zone=hot, master=false} into {elk-edata05-104_hot}{935QVlnoQYa5yRc0iGNotg}{10.170.31.249}{10.170.31.249:9301}{zone=hot, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]
RecoveryFailedException[[ngrtc_applog-2017.04.08][0]: Recovery failed from {elk-edata03-104_hot}{MInCL79VR-CGIGrHEQMGlA}{10.17x.247}{10.17x.247:9300}{zone=hot, master=false} into {elk-edata05-104_hot}{935QVlnoQYa5yRc0iGNotg}{10.170.31.249}{10.170x49:9301}{zone=hot, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
	at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:235)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]

[[ngrtc_applog-2017.04.08][0]] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[ngrtc_applog-2017.04.08][0]: Recovery failed from {elk-edata03-104_hot}{MInCL79VR-CGIGrHEQMGlA}{10.170.31.247}{10.170.31.247:9300}{zone=hot, master=false} into {elk-edata05-104_hot}{935QVlnoQYa5yRc0iGNotg}{10.1x1.249}{10.x1.249:9301}{zone=hot, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
	at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:235)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]
	... 5 more

stuck flame did come later, three minutes before symptoms have disappeared

How many nodes in your cluster, how many indices and shards, what is the total amount of data in GB/TB?

15 data nodes, 912 indices, 1620 shards, may be hardware problem

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.