ES node when under heavy reads throws stacktraces & recoveries, unclear why?

slmingol · June 3, 2017, 2:12am

We have a 2.3.3 ES cluster with 5 dn nodes with the following ES configs:

index.number_of_shards: 1
index.number_of_replicas: 4

The rest is pretty much the defaults. Everything is typically fine, but a couple of our indices when under heavy reads will exhibit the following stacktrace in the ES log:

[2017-05-12 04:33:55,745][DEBUG][action.search            ] [qa13-ost-1020x-h-ds01] All shards failed for phase: [query_fetch]
RemoteTransportException[[qa13-ost-1020x-h-as01][192.168.104.110:9300][indices:data/read/search[phase/query+fetch]]]; nested: ShardNotFoundException[no such shard];
Caused by: [qa-hsbcuk1][[qa-hsbcuk1][0]] ShardNotFoundException[no such shard]
	at org.elasticsearch.index.IndexService.shardSafe(IndexService.java:197)
	at org.elasticsearch.search.SearchService.createContext(SearchService.java:639)
	at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:620)
	at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:463)
	at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:392)
	at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:389)
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
	at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
...

These end up returning 503's to our application which is making calls to ES via the REST API. They're brief and afterwards the shards return to green.

As we attempt to debug this we've noticed these recoveries as well which look to correspond to the same timeframes where we're seeing the above. They start with a STORE from the server that appears to have the primary shard:

 "qa-hsbcuk1" : {
    "shards" : [ {
      "id" : 0,
      "type" : "STORE",
      "stage" : "DONE",
      "primary" : true,
      "start_time" : "2017-05-12T08:33:55.817Z",
      "start_time_in_millis" : 1494578035817,
      "stop_time" : "2017-05-12T08:33:55.827Z",
      "stop_time_in_millis" : 1494578035827,
      "total_time" : "10ms",
      "total_time_in_millis" : 10,
      "source" : {
        "id" : "QZdQAM-oQ_e__vUeAzNOsw",
        "host" : "192.168.104.110",
        "transport_address" : "192.168.104.110:9300",
        "ip" : "192.168.104.110",
        "name" : "qa13-ost-1020x-h-as01"
      },
      "target" : {
        "id" : "QZdQAM-oQ_e__vUeAzNOsw",
        "host" : "192.168.104.110",
        "transport_address" : "192.168.104.110:9300",
        "ip" : "192.168.104.110",
        "name" : "qa13-ost-1020x-h-as01"
      },
      "index" : {
        "size" : {
          "total" : "0b",
          "total_in_bytes" : 0,
          "reused" : "0b",
          "reused_in_bytes" : 0,
          "recovered" : "0b",
          "recovered_in_bytes" : 0,
          "percent" : "0.0%"
        },
        "files" : {
          "total" : 0,
          "reused" : 0,
          "recovered" : 0,
          "percent" : "0.0%"
        },
        "total_time" : "0s",
        "total_time_in_millis" : 0,
        "source_throttle_time" : "-1",
        "source_throttle_time_in_millis" : 0,
        "target_throttle_time" : "-1",
        "target_throttle_time_in_millis" : 0
      },
      "translog" : {
        "recovered" : 0,
        "total" : 0,
        "percent" : "100.0%",
        "total_on_start" : 0,
        "total_time" : "9ms",
        "total_time_in_millis" : 9
      },
      "verify_index" : {
        "check_index_time" : "0s",
        "check_index_time_in_millis" : 0,
        "total_time" : "0s",
        "total_time_in_millis" : 0
      }:

Followed by 4 REPLICAs:

  }, {
      "id" : 0,
      "type" : "REPLICA",
      "stage" : "DONE",
      "primary" : false,
      "start_time" : "2017-05-12T08:33:55.881Z",
      "start_time_in_millis" : 1494578035881,
      "stop_time" : "2017-05-12T08:33:55.925Z",
      "stop_time_in_millis" : 1494578035925,
      "total_time" : "43ms",
      "total_time_in_millis" : 43,
      "source" : {
        "id" : "QZdQAM-oQ_e__vUeAzNOsw",
        "host" : "192.168.104.110",
        "transport_address" : "192.168.104.110:9300",
        "ip" : "192.168.104.110",
        "name" : "qa13-ost-1020x-h-as01"
      },
      "target" : {
        "id" : "v25bTq0sQcadYs-ORzisJg",
        "host" : "192.168.104.109",
        "transport_address" : "192.168.104.109:9300",
        "ip" : "192.168.104.109",
        "name" : "qa13-ost-1020x-h-ds01"
      },
      "index" : {
        "size" : {
          "total" : "130b",
          "total_in_bytes" : 130,
          "reused" : "0b",
          "reused_in_bytes" : 0,
          "recovered" : "130b",
          "recovered_in_bytes" : 130,
          "percent" : "100.0%"
        },
        "files" : {
          "total" : 1,
          "reused" : 0,
          "recovered" : 1,
          "percent" : "100.0%"
        },
        "total_time" : "30ms",
        "total_time_in_millis" : 30,
        "source_throttle_time" : "0s",
        "source_throttle_time_in_millis" : 0,
        "target_throttle_time" : "-1",
        "target_throttle_time_in_millis" : 0
      },
      "translog" : {
        "recovered" : 0,
        "total" : 0,
        "percent" : "100.0%",
        "total_on_start" : 0,
        "total_time" : "9ms",
        "total_time_in_millis" : 9
      },
      "verify_index" : {
        "check_index_time" : "0s",
        "check_index_time_in_millis" : 0,
        "total_time" : "0s",
        "total_time_in_millis" : 0
      }
....

It's unclear to us why this is happening.

Clinton_Gormley · June 9, 2017, 11:24am

You need to check your logs to find out why those nodes or shards are failing. The log you provided is too late - the shard has already disappeared

system · July 7, 2017, 11:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES nodes crashing: failed to send failed shard Elasticsearch	6	2519	July 5, 2017
StackOverflowError in ES Elasticsearch	5	7920	July 6, 2017
Failed Shard Recovery Elasticsearch	5	3182	July 6, 2017
NodeNotConnectedException Elasticsearch	1	441	July 6, 2017
Recover shard failed Elasticsearch	1	1561	November 16, 2017

ES node when under heavy reads throws stacktraces & recoveries, unclear why?

Related topics