We have a 2.3.3 ES cluster with 5 dn nodes with the following ES configs:
index.number_of_shards: 1
index.number_of_replicas: 4
The rest is pretty much the defaults. Everything is typically fine, but a couple of our indices when under heavy reads will exhibit the following stacktrace in the ES log:
[2017-05-12 04:33:55,745][DEBUG][action.search ] [qa13-ost-1020x-h-ds01] All shards failed for phase: [query_fetch]
RemoteTransportException[[qa13-ost-1020x-h-as01][192.168.104.110:9300][indices:data/read/search[phase/query+fetch]]]; nested: ShardNotFoundException[no such shard];
Caused by: [qa-hsbcuk1][[qa-hsbcuk1][0]] ShardNotFoundException[no such shard]
at org.elasticsearch.index.IndexService.shardSafe(IndexService.java:197)
at org.elasticsearch.search.SearchService.createContext(SearchService.java:639)
at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:620)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:463)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:392)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:389)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
...
These end up returning 503's to our application which is making calls to ES via the REST API. They're brief and afterwards the shards return to green.
As we attempt to debug this we've noticed these recoveries as well which look to correspond to the same timeframes where we're seeing the above. They start with a STORE from the server that appears to have the primary shard:
"qa-hsbcuk1" : {
"shards" : [ {
"id" : 0,
"type" : "STORE",
"stage" : "DONE",
"primary" : true,
"start_time" : "2017-05-12T08:33:55.817Z",
"start_time_in_millis" : 1494578035817,
"stop_time" : "2017-05-12T08:33:55.827Z",
"stop_time_in_millis" : 1494578035827,
"total_time" : "10ms",
"total_time_in_millis" : 10,
"source" : {
"id" : "QZdQAM-oQ_e__vUeAzNOsw",
"host" : "192.168.104.110",
"transport_address" : "192.168.104.110:9300",
"ip" : "192.168.104.110",
"name" : "qa13-ost-1020x-h-as01"
},
"target" : {
"id" : "QZdQAM-oQ_e__vUeAzNOsw",
"host" : "192.168.104.110",
"transport_address" : "192.168.104.110:9300",
"ip" : "192.168.104.110",
"name" : "qa13-ost-1020x-h-as01"
},
"index" : {
"size" : {
"total" : "0b",
"total_in_bytes" : 0,
"reused" : "0b",
"reused_in_bytes" : 0,
"recovered" : "0b",
"recovered_in_bytes" : 0,
"percent" : "0.0%"
},
"files" : {
"total" : 0,
"reused" : 0,
"recovered" : 0,
"percent" : "0.0%"
},
"total_time" : "0s",
"total_time_in_millis" : 0,
"source_throttle_time" : "-1",
"source_throttle_time_in_millis" : 0,
"target_throttle_time" : "-1",
"target_throttle_time_in_millis" : 0
},
"translog" : {
"recovered" : 0,
"total" : 0,
"percent" : "100.0%",
"total_on_start" : 0,
"total_time" : "9ms",
"total_time_in_millis" : 9
},
"verify_index" : {
"check_index_time" : "0s",
"check_index_time_in_millis" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
}:
Followed by 4 REPLICAs:
}, {
"id" : 0,
"type" : "REPLICA",
"stage" : "DONE",
"primary" : false,
"start_time" : "2017-05-12T08:33:55.881Z",
"start_time_in_millis" : 1494578035881,
"stop_time" : "2017-05-12T08:33:55.925Z",
"stop_time_in_millis" : 1494578035925,
"total_time" : "43ms",
"total_time_in_millis" : 43,
"source" : {
"id" : "QZdQAM-oQ_e__vUeAzNOsw",
"host" : "192.168.104.110",
"transport_address" : "192.168.104.110:9300",
"ip" : "192.168.104.110",
"name" : "qa13-ost-1020x-h-as01"
},
"target" : {
"id" : "v25bTq0sQcadYs-ORzisJg",
"host" : "192.168.104.109",
"transport_address" : "192.168.104.109:9300",
"ip" : "192.168.104.109",
"name" : "qa13-ost-1020x-h-ds01"
},
"index" : {
"size" : {
"total" : "130b",
"total_in_bytes" : 130,
"reused" : "0b",
"reused_in_bytes" : 0,
"recovered" : "130b",
"recovered_in_bytes" : 130,
"percent" : "100.0%"
},
"files" : {
"total" : 1,
"reused" : 0,
"recovered" : 1,
"percent" : "100.0%"
},
"total_time" : "30ms",
"total_time_in_millis" : 30,
"source_throttle_time" : "0s",
"source_throttle_time_in_millis" : 0,
"target_throttle_time" : "-1",
"target_throttle_time_in_millis" : 0
},
"translog" : {
"recovered" : 0,
"total" : 0,
"percent" : "100.0%",
"total_on_start" : 0,
"total_time" : "9ms",
"total_time_in_millis" : 9
},
"verify_index" : {
"check_index_time" : "0s",
"check_index_time_in_millis" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
}
....
It's unclear to us why this is happening.