A node "Received response for a request that has timed out, sent [24704ms] ago, timed out [9704ms] ago, action [cluster: monitor / nodes / stats [n]]," stuck entire cluster

zcola · April 18, 2017, 3:25am

Elasticsearch version:
2.4.4
Plugins installed: []
analysis-ik, graph, kopf, license, marvel-agent, repository-hdfs
JVM version:
ocr 1.8.0_66
OS version:
Debian 8
Description of the problem including expected versus actual behavior:
A node restores the entire cluster from time to time or rejoins the cluster after a few minutes from the cluster
Steps to reproduce:
1.Such as bulk deletion of indexes or relocating shards, and even no special operations
2.marvel see index rate down to zero
3.system monitor see disk IOPS down to zero, load is high
4.any node excute _cat/indices just stuck
5.After a few minutes past the normal
5.There are problems of the node is a specific few, two weeks before the first appearance, before the normal, one of the off the assembly line after the re-do raid after the return to normal

Provide logs (if relevant):

Received response for a request that has timed out, sent [24704ms] ago, timed out [9704ms] ago, action [cluster:monitor/nodes/stats[n]], node [{elk-edata01-104_hot}{Jj7XccXoTsa8RgcTn63AOw}{1x245}{10.x1.245:9301}{zone=hot, group=small, master=false}], id [32139009]
failed to execute on node [Jj7XccXoTsa8RgcTn63AOw]
ReceiveTimeoutTransportException[[elk-edata01-104_hot][10x.245:9301][cluster:monitor/nodes/stats[n]] request_id [32139009] timed out after [15000ms]]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:698)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

cluster state update task [shard-failed ([websvr_cachelog-2017.x.1x4][0], node[XauWrlHjT7KmZUIUrHgRCg], relocating [MInCL79VR-CGIGrHEQMGlA], [P], v[10], s[INITIALIZING], a[id=MwKQ9776RMG1gt8L2d5Y_w, rId=uSaQpLHDTRu62oXqm7TUgg], expected_shard_size[16399523701]), message [failed recovery]] took 30s above the warn threshold of 30s
 
 cluster state update task [zen-disco-receive(from master [{elk-edata02-104_master}{IMTZEFOWQk6s11vj2X5wog}{10.x.246}{10.17x46:9300}{data=false, zone=master, master=true}])] took 1.4m above the warn threshold of 30s
 
timed out waiting for all nodes to process published state [63897] (timeout [30s], pending nodes: [{elk-edata01-104_hot}{Jj7XccXoTsa8RgcTn63AOw}{10.1x1.245}{10.x31.245:9301}{zone=hot, group=small, master=false}])

[ngrtc_applog-2017.04.08][0] received shard failed for target shard [[ngrtc_applog-2017.04.08][0], node[935QVlnoQYa5yRc0iGNotg], relocating [MInCL79VR-CGIGrHEQMGlA], [P], v[37], s[INITIALIZING], a[id=vXcHYEuuSR23SGjPJxB7UA, rId=gPKCaNt7RT2KBlJVrwNXEw], expected_shard_size[19883651556]], indexUUID [H7M-a4tsSDiSAWQyYtayEA], message [failed recovery], failure [RecoveryFailedException[[ngrtc_applog-2017.04.08][0]: Recovery failed from {elk-edata03-104_hot}{MInCL79VR-CGIGrHEQMGlA}{10.170.31.247}{10.170.31.247:9300}{zone=hot, master=false} into {elk-edata05-104_hot}{935QVlnoQYa5yRc0iGNotg}{10.170.31.249}{10.170.31.249:9301}{zone=hot, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]
RecoveryFailedException[[ngrtc_applog-2017.04.08][0]: Recovery failed from {elk-edata03-104_hot}{MInCL79VR-CGIGrHEQMGlA}{10.17x.247}{10.17x.247:9300}{zone=hot, master=false} into {elk-edata05-104_hot}{935QVlnoQYa5yRc0iGNotg}{10.170.31.249}{10.170x49:9301}{zone=hot, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
	at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:235)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]

[[ngrtc_applog-2017.04.08][0]] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[ngrtc_applog-2017.04.08][0]: Recovery failed from {elk-edata03-104_hot}{MInCL79VR-CGIGrHEQMGlA}{10.170.31.247}{10.170.31.247:9300}{zone=hot, master=false} into {elk-edata05-104_hot}{935QVlnoQYa5yRc0iGNotg}{10.1x1.249}{10.x1.249:9301}{zone=hot, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
	at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:235)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]
	... 5 more

stuck flame did come later, three minutes before symptoms have disappeared

gist.github.com

https://gist.github.com/zcola/27325b2290a5cb7bb3fa516911eac2f4

gistfile1.txt

<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" width="1200" height="930" onload="init(evt)" viewBox="0 0 1200 930" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<defs >
	<linearGradient id="background" y1="0" y2="1" x1="0" x2="0" >
		<stop stop-color="#f8f8f8" offset="5%" />
		<stop stop-color="#e8e8e8" offset="95%" />
	</linearGradient>
</defs>
<style type="text/css">

This file has been truncated. show original

warkolm · April 18, 2017, 11:11pm

How many nodes in your cluster, how many indices and shards, what is the total amount of data in GB/TB?

zcola · April 19, 2017, 3:22am

15 data nodes, 912 indices, 1620 shards, may be hardware problem

system · May 17, 2017, 3:26am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Action [cluster:monitor/nodes/stats[n]] timed out Elasticsearch	9	1375	November 5, 2022
Received response for a request that has timed out Elasticsearch	1	1733	February 6, 2020
Received response for a request that has timed out and "failed to retrieve stats for node" Elasticsearch	8	861	October 26, 2023
Timeout in Elastic cluster:monitor/nodes/stats Elasticsearch	7	1736	June 20, 2018
[o.e.t.TransportService] Received response for a request that has timed out Elasticsearch	17	742	August 28, 2023

A node "Received response for a request that has timed out, sent [24704ms] ago, timed out [9704ms] ago, action [cluster: monitor / nodes / stats [n]]," stuck entire cluster

Related topics