Elasticsearch version:
2.4.4
Plugins installed: []
analysis-ik, graph, kopf, license, marvel-agent, repository-hdfs
JVM version:
ocr 1.8.0_66
OS version:
Debian 8
Description of the problem including expected versus actual behavior:
A node restores the entire cluster from time to time or rejoins the cluster after a few minutes from the cluster
Steps to reproduce:
1.Such as bulk deletion of indexes or relocating shards, and even no special operations
2.marvel see index rate down to zero
3.system monitor see disk IOPS down to zero, load is high
4.any node excute _cat/indices
just stuck
5.After a few minutes past the normal
5.There are problems of the node is a specific few, two weeks before the first appearance, before the normal, one of the off the assembly line after the re-do raid
after the return to normal
Provide logs (if relevant):
Received response for a request that has timed out, sent [24704ms] ago, timed out [9704ms] ago, action [cluster:monitor/nodes/stats[n]], node [{elk-edata01-104_hot}{Jj7XccXoTsa8RgcTn63AOw}{1x245}{10.x1.245:9301}{zone=hot, group=small, master=false}], id [32139009]
failed to execute on node [Jj7XccXoTsa8RgcTn63AOw]
ReceiveTimeoutTransportException[[elk-edata01-104_hot][10x.245:9301][cluster:monitor/nodes/stats[n]] request_id [32139009] timed out after [15000ms]]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:698)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
cluster state update task [shard-failed ([websvr_cachelog-2017.x.1x4][0], node[XauWrlHjT7KmZUIUrHgRCg], relocating [MInCL79VR-CGIGrHEQMGlA], [P], v[10], s[INITIALIZING], a[id=MwKQ9776RMG1gt8L2d5Y_w, rId=uSaQpLHDTRu62oXqm7TUgg], expected_shard_size[16399523701]), message [failed recovery]] took 30s above the warn threshold of 30s
cluster state update task [zen-disco-receive(from master [{elk-edata02-104_master}{IMTZEFOWQk6s11vj2X5wog}{10.x.246}{10.17x46:9300}{data=false, zone=master, master=true}])] took 1.4m above the warn threshold of 30s
timed out waiting for all nodes to process published state [63897] (timeout [30s], pending nodes: [{elk-edata01-104_hot}{Jj7XccXoTsa8RgcTn63AOw}{10.1x1.245}{10.x31.245:9301}{zone=hot, group=small, master=false}])
[ngrtc_applog-2017.04.08][0] received shard failed for target shard [[ngrtc_applog-2017.04.08][0], node[935QVlnoQYa5yRc0iGNotg], relocating [MInCL79VR-CGIGrHEQMGlA], [P], v[37], s[INITIALIZING], a[id=vXcHYEuuSR23SGjPJxB7UA, rId=gPKCaNt7RT2KBlJVrwNXEw], expected_shard_size[19883651556]], indexUUID [H7M-a4tsSDiSAWQyYtayEA], message [failed recovery], failure [RecoveryFailedException[[ngrtc_applog-2017.04.08][0]: Recovery failed from {elk-edata03-104_hot}{MInCL79VR-CGIGrHEQMGlA}{10.170.31.247}{10.170.31.247:9300}{zone=hot, master=false} into {elk-edata05-104_hot}{935QVlnoQYa5yRc0iGNotg}{10.170.31.249}{10.170.31.249:9301}{zone=hot, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]
RecoveryFailedException[[ngrtc_applog-2017.04.08][0]: Recovery failed from {elk-edata03-104_hot}{MInCL79VR-CGIGrHEQMGlA}{10.17x.247}{10.17x.247:9300}{zone=hot, master=false} into {elk-edata05-104_hot}{935QVlnoQYa5yRc0iGNotg}{10.170.31.249}{10.170x49:9301}{zone=hot, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:235)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]
[[ngrtc_applog-2017.04.08][0]] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[ngrtc_applog-2017.04.08][0]: Recovery failed from {elk-edata03-104_hot}{MInCL79VR-CGIGrHEQMGlA}{10.170.31.247}{10.170.31.247:9300}{zone=hot, master=false} into {elk-edata05-104_hot}{935QVlnoQYa5yRc0iGNotg}{10.1x1.249}{10.x1.249:9301}{zone=hot, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:235)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]
... 5 more
stuck flame did come later, three minutes before symptoms have disappeared