ES nodes crashing: failed to send failed shard


(Dinesh) #1

Hi, we are running ES 2.0 on a 4 node cluster. ES is crashing 1-5 times in a day on nodes with different reasons. Can someone please suggest why is this happening and what is the way to get out of this situation. Crash is happening on all the 4 nodes every 6-12 hours. After crash cluster works normally, state becomes green, all shards are in STARTED state, until next crash.
Pasting one exception here.
There are other exceptions also like ShardNotFoundException, java.nio.file.NoSuchFileException, which can be found here http://pastebin.com/Y0fPULeL.

Thanks in advance.

[2016-10-06 21:05:10,347][WARN ][action.bulk ] [130591932414] [cfileindex][5] failed to perform indices:data/write/bulk[s] on node {130591932198}{E4qFPw3-TvizCAeB6ai_lw}{10.2.34.115}{10.2.34.115:25800}{master=true}
TransportException[transport stopped, action: indices:data/write/bulk[s][r]]
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:198)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2016-10-06 21:05:10,348][WARN ][cluster.action.shard ] [130591932414] failed to send failed shard to {130591932198}{E4qFPw3-TvizCAeB6ai_lw}{10.2.34.115}{10.2.34.115:25800}{master=true}
SendRequestTransportException[[130591932198][10.2.34.115:25800][internal:cluster/shard/failure]]; nested: TransportException[TransportService is closed stopped can't send request];
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:323)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:282)
at org.elasticsearch.cluster.action.shard.ShardStateAction.innerShardFailed(ShardStateAction.java:98)
at org.elasticsearch.cluster.action.shard.ShardStateAction.shardFailed(ShardStateAction.java:88)
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicationPhase$1.handleException(TransportReplicationAction.java:895)
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:198)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: TransportException[TransportService is closed stopped can't send request]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:303)
... 8 more
[2016-10-06 21:05:10,348][WARN ][cluster.action.shard ] [130591932414] failed to send failed shard to {130591932198}{E4qFPw3-TvizCAeB6ai_lw}{10.2.34.115}{10.2.34.115:25800}{master=true}
SendRequestTransportException[[130591932198][10.2.34.115:25800][internal:cluster/shard/failure]]; nested: TransportException[TransportService is closed stopped can't send request];


(Christian Dahlqvist) #2

Can you please provide some more details about your cluster? How are the nodes configured? How are they deployed? What is the specification of the nodes? How much data?


(Dinesh) #3

Hi Christian, we have deployed ES by downloading the zip file. Used space by ES data directory on each node is around 50G and there is ample free space. Its a 32 core physical machine, but ours is a shared environment, so other processes are also running on machine. Dedicated memory given for heap is 3G. We have specified settings in elasticsearch.yml file. We have 2 indexes, one with 5 shards and other one with 10. We create indices only once and never delete them. Please let me know if any other information is required. Thanks.

====== elasticsearch.yml file =========
cluster.routing.allocation.disk.watermark.low: 20gb
cluster.routing.allocation.disk.watermark.high: 500mb
node.name: 130591932414
node.master: true
node.data: true
discovery.zen.ping.timeout: 5s
discovery.zen.minimum_master_nodes: 2
indices.recovery.max_bytes_per_sec: 80mb
index.refresh_interval: 10s
index.merge.scheduler.max_thread_count: 1
http.port: 25700
transport.tcp.port: 25800
network.bind_host: 0.0.0.0
network.publish_host: 10.2.34.119
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [10.2.34.121,10.2.34.115,10.2.34.117]


(Christian Dahlqvist) #4

Can you please provide the output from the _cat/nodes API? Do all the nodes have the same configuration?


(Dinesh) #5

Yes. All the nodes have the same configuration. Thanks !

host ip heap.percent ram.percent load node.role master name
10.2.34.117 10.2.34.117 25 99 27.92 d m 130591932210
10.2.34.119 10.2.34.119 16 99 53.95 d m 130591932414
10.2.34.115 10.2.34.115 15 99 30.35 d * 130591932198
10.2.34.121 10.2.34.121 22 99 39.68 d m 130591952482


(Christian Dahlqvist) #6

It may have nothing to do with the errors you are seeing, but as you have 4 master eligible nodes, minimum_master_nodes should be set to 3 in order to avoid split brain scenarios. I would recommend eliminating this as a potential issue. Is there anything else in the logs around the time the nodes crash?


(system) #7