All shards failed exception and Elasticsearch service stopped

iluvcode · December 1, 2017, 4:34pm

Hi,
Elasticsearch 5.3 service installed in our production server stopped by itself and after looking at the log files, i found below information. since the query is requesting 85580 records, all shards failed and elasticsearch shut itself down? if that's the reason, how can i fix this issue ? i looked at "max_result_window" and does setting the size to 10k will help ?

[2017-11-30T13:34:09,019][WARN ][r.suppressed             ] path: /elasticsearchlive/searchentry/_search, params: {index=elasticsearchlive, type=searchentry}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onInitialPhaseResult(AbstractSearchAsyncAction.java:223) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:58) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:148) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:51) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1032) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1134) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1112) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.transport.TransportService$7.onFailure(TransportService.java:629) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:598) [elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) [elasticsearch-5.3.0.jar:5.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
Caused by: org.elasticsearch.transport.RemoteTransportException: [web1][10.100.6.2:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.query.QueryPhaseExecutionException: Result window is too large, from + size must be less than or equal to: [10000] but was [85580]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.
	at org.elasticsearch.search.DefaultSearchContext.preProcess(DefaultSearchContext.java:202) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:90) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.search.SearchService.createContext(SearchService.java:480) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:444) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:252) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:331) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:328) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:618) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:613) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.3.0.jar:5.3.0]
	... 3 more
[2017-11-30T15:31:03,379][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [web1] fatal error in thread [elasticsearch[web1][search][T#25]], exiting
java.lang.StackOverflowError: null
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
	at org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1309) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]

luiz.santos · December 8, 2017, 12:29pm

Hi @iluvcode,

I think we two event here. The first one happened [2017-11-30T13:34:09,019] and is just a WARN. This warn means that you are trying to paginate too deep. Typically when parameters from + size used exceed 10.000 (by default). If you want to retrieve 85580 documents maybe you will interested in the scroll api or search after.

The second event happened about two hours latter [2017-11-30T15:31:03,379] and caused the crash. It's not clear for me what might have cause this fatal error. Please, if you have more evidences feel free to post here and maybe we can determine the root cause.

Cheers,
LG

iluvcode · December 8, 2017, 7:35pm

@luiz.santos : apart from the events I posted, i don't have any other evidences. Same issue happened in other production server this week. is there a way find out why Elasticsearch service is crashing ?

Thanks

luiz.santos · December 8, 2017, 7:47pm

Hi @iluvcode,

Some important informations are:

elasticsearch version
how are using elasticsearch
number of indices
total number of shards
most frequent operation
any suspect query?
is your cluster suffering of out of memory?
etc

Did you followed the production guides?

Cheers,
LG

iluvcode · December 11, 2017, 4:02pm

elasticsearch version
Elasticsearch 5.3.0
how are using elasticsearch
We are using elasticsearch for our public facing ecommerce search.
number of indices (Index information)
"elasticsearchlive": {
"primaries": {
"docs": {
"count": 1378271,
"deleted": 637080
},
"store": {
"size_in_bytes": 20084703077,
"throttle_time_in_millis": 0
},
"indexing": {
"index_total": 9691855,
"index_time_in_millis": 20032575,
"index_current": 0,
"index_failed": 0,
"delete_total": 1601911,
"delete_time_in_millis": 102978,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 0,
"time_in_millis": 0,
"exists_total": 0,
"exists_time_in_millis": 0,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 157221949,
"query_time_in_millis": 94346276,
"query_current": 0,
"fetch_total": 51648965,
"fetch_time_in_millis": 30139609,
"fetch_current": 0,
"scroll_total": 8284560,
"scroll_time_in_millis": 7317305,
"scroll_current": 0,
"suggest_total": 19167516,
"suggest_time_in_millis": 18501423,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 42711,
"total_time_in_millis": 42241303,
"total_docs": 108222518,
"total_size_in_bytes": 297588814453,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 15174382,
"total_auto_throttle_in_bytes": 26214400
},
"refresh": {
"total": 390096,
"total_time_in_millis": 46504445,
"listeners": 0
},
"flush": {
"total": 2022,
"total_time_in_millis": 444978
},
"warmer": {
"current": 0,
"total": 391854,
"total_time_in_millis": 174148
},
"query_cache": {
"memory_size_in_bytes": 238823384,
"total_count": 1615358046,
"hit_count": 267504494,
"miss_count": 1347853552,
"cache_size": 74725,
"cache_count": 3414241,
"evictions": 3339516
},
"fielddata": {
"memory_size_in_bytes": 0,
"evictions": 0
},
"completion": {
"size_in_bytes": 102963592
},
"segments": {
"count": 137,
"memory_in_bytes": 118387711,
"terms_memory_in_bytes": 112327761,
"stored_fields_memory_in_bytes": 623184,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 368832,
"points_memory_in_bytes": 1301218,
"doc_values_memory_in_bytes": 3766716,
"index_writer_memory_in_bytes": 0,
"version_map_memory_in_bytes": 99111,
"fixed_bit_set_memory_in_bytes": 257968,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {}
},
"translog": {
"operations": 103654,
"size_in_bytes": 380121790
},
"request_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 30359,
"miss_count": 8242
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
},
"total": {
"docs": {
"count": 1378271,
"deleted": 637080
},
"store": {
"size_in_bytes": 20084703077,
"throttle_time_in_millis": 0
},
"indexing": {
"index_total": 9691855,
"index_time_in_millis": 20032575,
"index_current": 0,
"index_failed": 0,
"delete_total": 1601911,
"delete_time_in_millis": 102978,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 0,
"time_in_millis": 0,
"exists_total": 0,
"exists_time_in_millis": 0,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 157221949,
"query_time_in_millis": 94346276,
"query_current": 0,
"fetch_total": 51648965,
"fetch_time_in_millis": 30139609,
"fetch_current": 0,
"scroll_total": 8284560,
"scroll_time_in_millis": 7317305,
"scroll_current": 0,
"suggest_total": 19167516,
"suggest_time_in_millis": 18501423,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 42711,
"total_time_in_millis": 42241303,
"total_docs": 108222518,
"total_size_in_bytes": 297588814453,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 15174382,
"total_auto_throttle_in_bytes": 26214400
},
"refresh": {
"total": 390096,
"total_time_in_millis": 46504445,
"listeners": 0
},
"flush": {
"total": 2022,
"total_time_in_millis": 444978
},
"warmer": {
"current": 0,
"total": 391854,
"total_time_in_millis": 174148
},
"query_cache": {
"memory_size_in_bytes": 238823384,
"total_count": 1615358046,
"hit_count": 267504494,
"miss_count": 1347853552,
"cache_size": 74725,
"cache_count": 3414241,
"evictions": 3339516
},
"fielddata": {
"memory_size_in_bytes": 0,
"evictions": 0
},
"completion": {
"size_in_bytes": 102963592
},
"segments": {
"count": 137,
"memory_in_bytes": 118387711,
"terms_memory_in_bytes": 112327761,
"stored_fields_memory_in_bytes": 623184,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 368832,
"points_memory_in_bytes": 1301218,
"doc_values_memory_in_bytes": 3766716,
"index_writer_memory_in_bytes": 0,
"version_map_memory_in_bytes": 99111,
"fixed_bit_set_memory_in_bytes": 257968,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {}
},
"translog": {
"operations": 103654,
"size_in_bytes": 380121790
},
"request_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 30359,
"miss_count": 8242
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
}
}
most frequent operation
Search query
any suspect query?
None
is your cluster suffering of out of memory?
Our production server is using SSD drive with 120GB of memory and we allocated 30GB to Elasticsearch.

Yes, I did follow Production guides. we never had this issue when using Elasticsearch 1.7.1. After upgrading elasticsearch to 5.3, i started seeing this issue.
Please let me know if you need more information.

Thanks

system · January 8, 2018, 4:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Failed to execute phase [query_fetch], total failure Elasticsearch	7	1057	July 6, 2017
Intermittent 505's Elasticsearch	10	700	July 6, 2017
Queue size Elasticsearch	6	673	July 6, 2017
Weird Exception Elasticsearch	5	464	July 6, 2017
Elasticsearch Keeps Crashing shards and data to big Elasticsearch	18	2978	May 22, 2020

All shards failed exception and Elasticsearch service stopped

Related topics