All shards failed exception and Elasticsearch service stopped

Hi,
Elasticsearch 5.3 service installed in our production server stopped by itself and after looking at the log files, i found below information. since the query is requesting 85580 records, all shards failed and elasticsearch shut itself down? if that's the reason, how can i fix this issue ? i looked at "max_result_window" and does setting the size to 10k will help ?

[2017-11-30T13:34:09,019][WARN ][r.suppressed             ] path: /elasticsearchlive/searchentry/_search, params: {index=elasticsearchlive, type=searchentry}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onInitialPhaseResult(AbstractSearchAsyncAction.java:223) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:58) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:148) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:51) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1032) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1134) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1112) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.transport.TransportService$7.onFailure(TransportService.java:629) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:598) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) [elasticsearch-5.3.0.jar:5.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
Caused by: org.elasticsearch.transport.RemoteTransportException: [web1][10.100.6.2:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.query.QueryPhaseExecutionException: Result window is too large, from + size must be less than or equal to: [10000] but was [85580]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.
	at org.elasticsearch.search.DefaultSearchContext.preProcess(DefaultSearchContext.java:202) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:90) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.search.SearchService.createContext(SearchService.java:480) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:444) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:252) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:331) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:328) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:618) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:613) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.3.0.jar:5.3.0]
	... 3 more
[2017-11-30T15:31:03,379][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [web1] fatal error in thread [elasticsearch[web1][search][T#25]], exiting
java.lang.StackOverflowError: null
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]

Hi @iluvcode,

I think we two event here. The first one happened [2017-11-30T13:34:09,019] and is just a WARN. This warn means that you are trying to paginate too deep. Typically when parameters from + size used exceed 10.000 (by default). If you want to retrieve 85580 documents maybe you will interested in the scroll api or search after.

The second event happened about two hours latter [2017-11-30T15:31:03,379] and caused the crash. It's not clear for me what might have cause this fatal error. Please, if you have more evidences feel free to post here and maybe we can determine the root cause.

Cheers,
LG

@luiz.santos : apart from the events I posted, i don't have any other evidences. Same issue happened in other production server this week. is there a way find out why Elasticsearch service is crashing ?

Thanks

Hi @iluvcode,

Some important informations are:

  • elasticsearch version
  • how are using elasticsearch
  • number of indices
  • total number of shards
  • most frequent operation
  • any suspect query?
  • is your cluster suffering of out of memory?
  • etc

Did you followed the production guides?

Cheers,
LG

  • elasticsearch version
    Elasticsearch 5.3.0
  • how are using elasticsearch
    We are using elasticsearch for our public facing ecommerce search.
  • number of indices (Index information)
    "elasticsearchlive": {
    "primaries": {
    "docs": {
    "count": 1378271,
    "deleted": 637080
    },
    "store": {
    "size_in_bytes": 20084703077,
    "throttle_time_in_millis": 0
    },
    "indexing": {
    "index_total": 9691855,
    "index_time_in_millis": 20032575,
    "index_current": 0,
    "index_failed": 0,
    "delete_total": 1601911,
    "delete_time_in_millis": 102978,
    "delete_current": 0,
    "noop_update_total": 0,
    "is_throttled": false,
    "throttle_time_in_millis": 0
    },
    "get": {
    "total": 0,
    "time_in_millis": 0,
    "exists_total": 0,
    "exists_time_in_millis": 0,
    "missing_total": 0,
    "missing_time_in_millis": 0,
    "current": 0
    },
    "search": {
    "open_contexts": 0,
    "query_total": 157221949,
    "query_time_in_millis": 94346276,
    "query_current": 0,
    "fetch_total": 51648965,
    "fetch_time_in_millis": 30139609,
    "fetch_current": 0,
    "scroll_total": 8284560,
    "scroll_time_in_millis": 7317305,
    "scroll_current": 0,
    "suggest_total": 19167516,
    "suggest_time_in_millis": 18501423,
    "suggest_current": 0
    },
    "merges": {
    "current": 0,
    "current_docs": 0,
    "current_size_in_bytes": 0,
    "total": 42711,
    "total_time_in_millis": 42241303,
    "total_docs": 108222518,
    "total_size_in_bytes": 297588814453,
    "total_stopped_time_in_millis": 0,
    "total_throttled_time_in_millis": 15174382,
    "total_auto_throttle_in_bytes": 26214400
    },
    "refresh": {
    "total": 390096,
    "total_time_in_millis": 46504445,
    "listeners": 0
    },
    "flush": {
    "total": 2022,
    "total_time_in_millis": 444978
    },
    "warmer": {
    "current": 0,
    "total": 391854,
    "total_time_in_millis": 174148
    },
    "query_cache": {
    "memory_size_in_bytes": 238823384,
    "total_count": 1615358046,
    "hit_count": 267504494,
    "miss_count": 1347853552,
    "cache_size": 74725,
    "cache_count": 3414241,
    "evictions": 3339516
    },
    "fielddata": {
    "memory_size_in_bytes": 0,
    "evictions": 0
    },
    "completion": {
    "size_in_bytes": 102963592
    },
    "segments": {
    "count": 137,
    "memory_in_bytes": 118387711,
    "terms_memory_in_bytes": 112327761,
    "stored_fields_memory_in_bytes": 623184,
    "term_vectors_memory_in_bytes": 0,
    "norms_memory_in_bytes": 368832,
    "points_memory_in_bytes": 1301218,
    "doc_values_memory_in_bytes": 3766716,
    "index_writer_memory_in_bytes": 0,
    "version_map_memory_in_bytes": 99111,
    "fixed_bit_set_memory_in_bytes": 257968,
    "max_unsafe_auto_id_timestamp": -1,
    "file_sizes": {}
    },
    "translog": {
    "operations": 103654,
    "size_in_bytes": 380121790
    },
    "request_cache": {
    "memory_size_in_bytes": 0,
    "evictions": 0,
    "hit_count": 30359,
    "miss_count": 8242
    },
    "recovery": {
    "current_as_source": 0,
    "current_as_target": 0,
    "throttle_time_in_millis": 0
    }
    },
    "total": {
    "docs": {
    "count": 1378271,
    "deleted": 637080
    },
    "store": {
    "size_in_bytes": 20084703077,
    "throttle_time_in_millis": 0
    },
    "indexing": {
    "index_total": 9691855,
    "index_time_in_millis": 20032575,
    "index_current": 0,
    "index_failed": 0,
    "delete_total": 1601911,
    "delete_time_in_millis": 102978,
    "delete_current": 0,
    "noop_update_total": 0,
    "is_throttled": false,
    "throttle_time_in_millis": 0
    },
    "get": {
    "total": 0,
    "time_in_millis": 0,
    "exists_total": 0,
    "exists_time_in_millis": 0,
    "missing_total": 0,
    "missing_time_in_millis": 0,
    "current": 0
    },
    "search": {
    "open_contexts": 0,
    "query_total": 157221949,
    "query_time_in_millis": 94346276,
    "query_current": 0,
    "fetch_total": 51648965,
    "fetch_time_in_millis": 30139609,
    "fetch_current": 0,
    "scroll_total": 8284560,
    "scroll_time_in_millis": 7317305,
    "scroll_current": 0,
    "suggest_total": 19167516,
    "suggest_time_in_millis": 18501423,
    "suggest_current": 0
    },
    "merges": {
    "current": 0,
    "current_docs": 0,
    "current_size_in_bytes": 0,
    "total": 42711,
    "total_time_in_millis": 42241303,
    "total_docs": 108222518,
    "total_size_in_bytes": 297588814453,
    "total_stopped_time_in_millis": 0,
    "total_throttled_time_in_millis": 15174382,
    "total_auto_throttle_in_bytes": 26214400
    },
    "refresh": {
    "total": 390096,
    "total_time_in_millis": 46504445,
    "listeners": 0
    },
    "flush": {
    "total": 2022,
    "total_time_in_millis": 444978
    },
    "warmer": {
    "current": 0,
    "total": 391854,
    "total_time_in_millis": 174148
    },
    "query_cache": {
    "memory_size_in_bytes": 238823384,
    "total_count": 1615358046,
    "hit_count": 267504494,
    "miss_count": 1347853552,
    "cache_size": 74725,
    "cache_count": 3414241,
    "evictions": 3339516
    },
    "fielddata": {
    "memory_size_in_bytes": 0,
    "evictions": 0
    },
    "completion": {
    "size_in_bytes": 102963592
    },
    "segments": {
    "count": 137,
    "memory_in_bytes": 118387711,
    "terms_memory_in_bytes": 112327761,
    "stored_fields_memory_in_bytes": 623184,
    "term_vectors_memory_in_bytes": 0,
    "norms_memory_in_bytes": 368832,
    "points_memory_in_bytes": 1301218,
    "doc_values_memory_in_bytes": 3766716,
    "index_writer_memory_in_bytes": 0,
    "version_map_memory_in_bytes": 99111,
    "fixed_bit_set_memory_in_bytes": 257968,
    "max_unsafe_auto_id_timestamp": -1,
    "file_sizes": {}
    },
    "translog": {
    "operations": 103654,
    "size_in_bytes": 380121790
    },
    "request_cache": {
    "memory_size_in_bytes": 0,
    "evictions": 0,
    "hit_count": 30359,
    "miss_count": 8242
    },
    "recovery": {
    "current_as_source": 0,
    "current_as_target": 0,
    "throttle_time_in_millis": 0
    }
    }
    }
  • most frequent operation
    Search query
  • any suspect query?
    None
  • is your cluster suffering of out of memory?
    Our production server is using SSD drive with 120GB of memory and we allocated 30GB to Elasticsearch.

Yes, I did follow Production guides. we never had this issue when using Elasticsearch 1.7.1. After upgrading elasticsearch to 5.3, i started seeing this issue.
Please let me know if you need more information.

Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.