ES node when under heavy reads throws stacktraces & recoveries, unclear why?

We have a 2.3.3 ES cluster with 5 dn nodes with the following ES configs:

index.number_of_shards: 1
index.number_of_replicas: 4

The rest is pretty much the defaults. Everything is typically fine, but a couple of our indices when under heavy reads will exhibit the following stacktrace in the ES log:

[2017-05-12 04:33:55,745][DEBUG][action.search            ] [qa13-ost-1020x-h-ds01] All shards failed for phase: [query_fetch]
RemoteTransportException[[qa13-ost-1020x-h-as01][192.168.104.110:9300][indices:data/read/search[phase/query+fetch]]]; nested: ShardNotFoundException[no such shard];
Caused by: [qa-hsbcuk1][[qa-hsbcuk1][0]] ShardNotFoundException[no such shard]
	at org.elasticsearch.index.IndexService.shardSafe(IndexService.java:197)
	at org.elasticsearch.search.SearchService.createContext(SearchService.java:639)
	at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:620)
	at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:463)
	at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:392)
	at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:389)
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
	at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
...

These end up returning 503's to our application which is making calls to ES via the REST API. They're brief and afterwards the shards return to green.

As we attempt to debug this we've noticed these recoveries as well which look to correspond to the same timeframes where we're seeing the above. They start with a STORE from the server that appears to have the primary shard:

 "qa-hsbcuk1" : {
    "shards" : [ {
      "id" : 0,
      "type" : "STORE",
      "stage" : "DONE",
      "primary" : true,
      "start_time" : "2017-05-12T08:33:55.817Z",
      "start_time_in_millis" : 1494578035817,
      "stop_time" : "2017-05-12T08:33:55.827Z",
      "stop_time_in_millis" : 1494578035827,
      "total_time" : "10ms",
      "total_time_in_millis" : 10,
      "source" : {
        "id" : "QZdQAM-oQ_e__vUeAzNOsw",
        "host" : "192.168.104.110",
        "transport_address" : "192.168.104.110:9300",
        "ip" : "192.168.104.110",
        "name" : "qa13-ost-1020x-h-as01"
      },
      "target" : {
        "id" : "QZdQAM-oQ_e__vUeAzNOsw",
        "host" : "192.168.104.110",
        "transport_address" : "192.168.104.110:9300",
        "ip" : "192.168.104.110",
        "name" : "qa13-ost-1020x-h-as01"
      },
      "index" : {
        "size" : {
          "total" : "0b",
          "total_in_bytes" : 0,
          "reused" : "0b",
          "reused_in_bytes" : 0,
          "recovered" : "0b",
          "recovered_in_bytes" : 0,
          "percent" : "0.0%"
        },
        "files" : {
          "total" : 0,
          "reused" : 0,
          "recovered" : 0,
          "percent" : "0.0%"
        },
        "total_time" : "0s",
        "total_time_in_millis" : 0,
        "source_throttle_time" : "-1",
        "source_throttle_time_in_millis" : 0,
        "target_throttle_time" : "-1",
        "target_throttle_time_in_millis" : 0
      },
      "translog" : {
        "recovered" : 0,
        "total" : 0,
        "percent" : "100.0%",
        "total_on_start" : 0,
        "total_time" : "9ms",
        "total_time_in_millis" : 9
      },
      "verify_index" : {
        "check_index_time" : "0s",
        "check_index_time_in_millis" : 0,
        "total_time" : "0s",
        "total_time_in_millis" : 0
      }:

Followed by 4 REPLICAs:

  }, {
      "id" : 0,
      "type" : "REPLICA",
      "stage" : "DONE",
      "primary" : false,
      "start_time" : "2017-05-12T08:33:55.881Z",
      "start_time_in_millis" : 1494578035881,
      "stop_time" : "2017-05-12T08:33:55.925Z",
      "stop_time_in_millis" : 1494578035925,
      "total_time" : "43ms",
      "total_time_in_millis" : 43,
      "source" : {
        "id" : "QZdQAM-oQ_e__vUeAzNOsw",
        "host" : "192.168.104.110",
        "transport_address" : "192.168.104.110:9300",
        "ip" : "192.168.104.110",
        "name" : "qa13-ost-1020x-h-as01"
      },
      "target" : {
        "id" : "v25bTq0sQcadYs-ORzisJg",
        "host" : "192.168.104.109",
        "transport_address" : "192.168.104.109:9300",
        "ip" : "192.168.104.109",
        "name" : "qa13-ost-1020x-h-ds01"
      },
      "index" : {
        "size" : {
          "total" : "130b",
          "total_in_bytes" : 130,
          "reused" : "0b",
          "reused_in_bytes" : 0,
          "recovered" : "130b",
          "recovered_in_bytes" : 130,
          "percent" : "100.0%"
        },
        "files" : {
          "total" : 1,
          "reused" : 0,
          "recovered" : 1,
          "percent" : "100.0%"
        },
        "total_time" : "30ms",
        "total_time_in_millis" : 30,
        "source_throttle_time" : "0s",
        "source_throttle_time_in_millis" : 0,
        "target_throttle_time" : "-1",
        "target_throttle_time_in_millis" : 0
      },
      "translog" : {
        "recovered" : 0,
        "total" : 0,
        "percent" : "100.0%",
        "total_on_start" : 0,
        "total_time" : "9ms",
        "total_time_in_millis" : 9
      },
      "verify_index" : {
        "check_index_time" : "0s",
        "check_index_time_in_millis" : 0,
        "total_time" : "0s",
        "total_time_in_millis" : 0
      }
....

It's unclear to us why this is happening.

You need to check your logs to find out why those nodes or shards are failing. The log you provided is too late - the shard has already disappeared

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.