ES nodes crashing: failed to send failed shard

dipathak · October 13, 2016, 6:38am

Hi, we are running ES 2.0 on a 4 node cluster. ES is crashing 1-5 times in a day on nodes with different reasons. Can someone please suggest why is this happening and what is the way to get out of this situation. Crash is happening on all the 4 nodes every 6-12 hours. After crash cluster works normally, state becomes green, all shards are in STARTED state, until next crash.
Pasting one exception here.
There are other exceptions also like ShardNotFoundException, java.nio.file.NoSuchFileException, which can be found here http://pastebin.com/Y0fPULeL.

Thanks in advance.

[2016-10-06 21:05:10,347][WARN ][action.bulk ] [130591932414] [cfileindex][5] failed to perform indices:data/write/bulk[s] on node {130591932198}{E4qFPw3-TvizCAeB6ai_lw}{10.2.34.115}{10.2.34.115:25800}{master=true}
TransportException[transport stopped, action: indices:data/write/bulk[s][r]]
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:198)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2016-10-06 21:05:10,348][WARN ][cluster.action.shard ] [130591932414] failed to send failed shard to {130591932198}{E4qFPw3-TvizCAeB6ai_lw}{10.2.34.115}{10.2.34.115:25800}{master=true}
SendRequestTransportException[[130591932198][10.2.34.115:25800][internal:cluster/shard/failure]]; nested: TransportException[TransportService is closed stopped can't send request];
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:323)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:282)
at org.elasticsearch.cluster.action.shard.ShardStateAction.innerShardFailed(ShardStateAction.java:98)
at org.elasticsearch.cluster.action.shard.ShardStateAction.shardFailed(ShardStateAction.java:88)
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicationPhase$1.handleException(TransportReplicationAction.java:895)
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:198)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: TransportException[TransportService is closed stopped can't send request]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:303)
... 8 more
[2016-10-06 21:05:10,348][WARN ][cluster.action.shard ] [130591932414] failed to send failed shard to {130591932198}{E4qFPw3-TvizCAeB6ai_lw}{10.2.34.115}{10.2.34.115:25800}{master=true}
SendRequestTransportException[[130591932198][10.2.34.115:25800][internal:cluster/shard/failure]]; nested: TransportException[TransportService is closed stopped can't send request];

Christian_Dahlqvist · October 13, 2016, 6:48am

Can you please provide some more details about your cluster? How are the nodes configured? How are they deployed? What is the specification of the nodes? How much data?

dipathak · October 13, 2016, 9:37am

Hi Christian, we have deployed ES by downloading the zip file. Used space by ES data directory on each node is around 50G and there is ample free space. Its a 32 core physical machine, but ours is a shared environment, so other processes are also running on machine. Dedicated memory given for heap is 3G. We have specified settings in elasticsearch.yml file. We have 2 indexes, one with 5 shards and other one with 10. We create indices only once and never delete them. Please let me know if any other information is required. Thanks.

====== elasticsearch.yml file =========
cluster.routing.allocation.disk.watermark.low: 20gb
cluster.routing.allocation.disk.watermark.high: 500mb
node.name: 130591932414
node.master: true
node.data: true
discovery.zen.ping.timeout: 5s
discovery.zen.minimum_master_nodes: 2
indices.recovery.max_bytes_per_sec: 80mb
index.refresh_interval: 10s
index.merge.scheduler.max_thread_count: 1
http.port: 25700
transport.tcp.port: 25800
network.bind_host: 0.0.0.0
network.publish_host: 10.2.34.119
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [10.2.34.121,10.2.34.115,10.2.34.117]

Christian_Dahlqvist · October 13, 2016, 9:44am

Can you please provide the output from the _cat/nodes API? Do all the nodes have the same configuration?

dipathak · October 14, 2016, 5:26am

Yes. All the nodes have the same configuration. Thanks !

host ip heap.percent ram.percent load node.role master name
10.2.34.117 10.2.34.117 25 99 27.92 d m 130591932210
10.2.34.119 10.2.34.119 16 99 53.95 d m 130591932414
10.2.34.115 10.2.34.115 15 99 30.35 d * 130591932198
10.2.34.121 10.2.34.121 22 99 39.68 d m 130591952482

Christian_Dahlqvist · October 14, 2016, 5:31am

It may have nothing to do with the errors you are seeing, but as you have 4 master eligible nodes, minimum_master_nodes should be set to 3 in order to avoid split brain scenarios. I would recommend eliminating this as a potential issue. Is there anything else in the logs around the time the nodes crash?

Topic		Replies	Views
Recover shard failed Elasticsearch	1	1561	November 16, 2017
"failed shard on node... ...Data too large, data for [<transport_request>] would be" only for 3 most recent .monitoring-es indices Elasticsearch	9	4924	March 26, 2020
Shard failures Elasticsearch	2	422	July 6, 2017
NodeNotConnectedException Elasticsearch	1	441	July 6, 2017
ES node when under heavy reads throws stacktraces & recoveries, unclear why? Elasticsearch	2	1387	July 7, 2017

ES nodes crashing: failed to send failed shard

Related topics