Search rejected execution on all indexes

We've been having an ongoing issue with our Elasticsearch server. Seemingly once a week it becomes unresponsive to search queries. (other queries seem to work fine - like _mapping). Below is a small snippet of one of the lines that continue to be added. It seems to add this for every index we have.

What puzzles me is that the logs report on indexes that are not being used. So why would a query fail to execute? (stld_song_search_dev is a development index).

Any help is greatly appreciated as this is our production server!

[2015-06-11 13:17:08,198][DEBUG][action.search.type       ] [search1] [stld_song_search_dev][2], node[oY5aYQSLQWOpVyS_DqeOkQ], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@561fc007] lastShard [true]
    org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23@398541f8
    	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
    	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
    	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
    	at org.elasticsearch.search.action.SearchServiceTransportAction.execute(SearchServiceTransportAction.java:551)
    	at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:228)
    	at org.elasticsearch.action.search.type.TransportSearchCountAction$AsyncAction.sendExecuteFirstPhase(TransportSearchCountAction.java:71)
    	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:176)
    	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:158)
    	at org.elasticsearch.action.search.type.TransportSearchCountAction.doExecute(TransportSearchCountAction.java:55)
    	at org.elasticsearch.action.search.type.TransportSearchCountAction.doExecute(TransportSearchCountAction.java:45)
    	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
    	at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:108)
    	at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)
    	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
    	at org.elasticsearch.action.search.TransportMultiSearchAction.doExecute(TransportMultiSearchAction.java:62)
    	at org.elasticsearch.action.search.TransportMultiSearchAction.doExecute(TransportMultiSearchAction.java:39)
    	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
    	at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:98)
    	at org.elasticsearch.client.support.AbstractClient.multiSearch(AbstractClient.java:364)
    	at org.elasticsearch.action.search.MultiSearchRequestBuilder.doExecute(MultiSearchRequestBuilder.java:78)
    	at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:91)
    	at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:65)
    	at org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:73)
    	at org.elasticsearch.search.suggest.phrase.PhraseSuggester.fetchMatchingDocCountResponses(PhraseSuggester.java:184)
    	at org.elasticsearch.search.suggest.phrase.PhraseSuggester.collate(PhraseSuggester.java:147)
    	at org.elasticsearch.search.suggest.phrase.PhraseSuggester.innerExecute(PhraseSuggester.java:108)
    	at org.elasticsearch.search.suggest.phrase.PhraseSuggester.innerExecute(PhraseSuggester.java:53)
    	at org.elasticsearch.search.suggest.Suggester.execute(Suggester.java:42)
    	at org.elasticsearch.search.suggest.SuggestPhase.execute(SuggestPhase.java:85)
    	at org.elasticsearch.search.suggest.SuggestPhase.execute(SuggestPhase.java:74)
    	at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:170)
    	at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:272)
    	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:283)
    	at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:231)
    	at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:228)
    	at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    	at java.lang.Thread.run(Thread.java:745)

I believe the issue is caused by the search threadpool queue size filling up. But I'm just not able to see why this is happening, how to debug the issue, and fix it.
http://imgur.com/OyRRaFz

Full threadpools means your cluster is overloaded, what sort of monitoring do you have in place?

We're monitoring the system with Graphite so we have full cpu, memory, disk etc. All those seem very low and steady.

We also recently installed Marvel to help debug this issue and it show the same low load except for a sudden spike in the search thread pool.

I'm assuming that spike is caused by some kind of application request, but is there any way to prevent the Elasticsearch node from completely becoming unresponsive during the spike? It seems to never recover.

Thanks for any help!

You really need to find out what is generating these searches.
Try turning the slowlog down to catch what is running and link it back from there.

Thanks for your help. This was our next step. We currently have have slowlog on search like the following, but it's not catching anything.

index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms

I'll try lowering the trace to 200ms and see if we can narrow down the queries causing the issues.

On a side note, we did make some application changes to prevent certain suggest queries to run all the time. We're using the following to do "did you mean?" type suggestions.

{
  "size": 0,
  "suggest": {
    "text": "search keyword not found",
    "simple_phrase": {
      "phrase": {
        "field": "title",
        "size": 1,
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        },
        "collate": {
          "query": {
            "match": {
              "{{field_name}}": "{{suggestion}}"
            }
          },
          "params": {
            "field_name": "title"
          },
          "preference": "_primary"
        }
      }
    }
  }
}

These types queries seem to be pretty expensive since they take on average 250ms to return. And from what I've seen with Elasticsearch, that's a long time. (usually under 30ms)

250ms is still pretty quick :stuck_out_tongue:

What is the index pattern u r using? and is documents added continuously?