ES 5.1.1 node stuck in endless loop halting the whole cluster

AndreCimander · January 12, 2017, 4:27pm

Hi everybody,

we are seeing cluster hangs caused by single nodes beeing stuck in a 100%-cpu thread loop. The offending node has timeouts for various collectors, yet it is not removed from the cluster and all nodes start to log failed search requests like crazy.

The only way to fix the cluster is to restart the offending node. Any idea how we could avoid this issue?

I've attached the log of the master node leading to the issue (pastebin)

And one specimen of the log spam after the cluster hangs:

[2017-01-12T16:44:28,205][DEBUG][o.e.a.s.TransportSearchAction] [es-big-03] [22889914] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [es-big-14][192.168.16.4:9300][indices:data/read/search[phase/fetch/id]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [22889914]
at org.elasticsearch.search.SearchService.findContext(SearchService.java:523) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:490) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.search.SearchTransportService$13.messageReceived(SearchTransportService.java:356) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.search.SearchTransportService$13.messageReceived(SearchTransportService.java:353) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1385) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

warkolm · January 16, 2017, 12:56am

Those timeouts don't really tell us much as to why this happened.
What is in the logs on the node?

AndreCimander · January 17, 2017, 3:12pm

Sorry for the delayed answer, it's a little hectic over here with the cluster instability. I haven't waded through the 9GB of logs on each node, but I upgraded the cluster to ES 5.1.2 hoping it would fix the situation.

We are still seeing the issue, the heap is slowly filling up to 99%, each GC run collecting less memory and then the nodes begin to 100%-cpu-loop.

Maybe I'm missing a config settings that results in >100% heap consumption, although I would expect a OOM node stop in that case. Our node settings:

# set various heap sizes
indices.memory.index_buffer_size: "4%"
indices.queries.cache.size: "17%"
indices.fielddata.cache.size: "58%"
indices.breaker.fielddata.limit: "61%"

# increase bulk limit
thread_pool.bulk.queue_size: 100000
thread_pool.search.queue_size: 100000
thread_pool.index.queue_size: 100000
thread_pool.get.queue_size: 100000

# higher defaults for http settings
http.max_content_length: "250mb"
http.max_initial_line_length: "16kb"
http.max_header_size: "16kb"

# allow external traffic compression
http.compression: true
http.compression_level: 1
# compress internal traffic
transport.tcp.compress: true

# set threads manually (as we can't unset cluster settings)
processors: 20
thread_pool.bulk.size: 20
thread_pool.search.size: 31
thread_pool.index.size: 20


# raise max shard query limit as we have somewhat more than 1000 shards (defaults to 1000)
action.search.shard_count.limit: 2000

jasontedor · January 17, 2017, 3:55pm

These are way too large. What do you think is going to happen if these queues fill up?

For a better understanding of why this is hurting you, please read my previous thoughts on this subject.

AndreCimander · January 17, 2017, 4:14pm

We have quite a number of workers, the queue size is just a safe guard to handle temporary spikes when a lot of workers are starting up at the same time (e.g. 1000 hourly workers starting to consume fresh hourly tasks at the top of the hour).

We hardly see any queue usage at all, just a few spikes here and there between 2k and 20k search requests and mostly <1k for bulk requests.

In the mean time we resort to search&destroy scripts that hunt nodes above a heap usage threshold.

AndreCimander · January 17, 2017, 4:25pm

Here are the monitoring stats of one of our misbehaving nodes:

I've disabled the compression for the internal and external traffic to rule out a compression memory leak and will also test a lowered query cache size tonight.

system · February 14, 2017, 4:25pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster hangs for 1h. no logs, no throughput Elasticsearch	7	1273	July 24, 2017
First steps troubleshooting ES cluster crashes? Elasticsearch	9	3538	March 3, 2018
Hanging active search threads Elasticsearch	1	320	July 13, 2020
Unexpeted and not logged hanging Elasticsearch	5	1227	July 6, 2017
Request to elasticsearch cluster hangs Elasticsearch	1	1139	July 5, 2017

ES 5.1.1 node stuck in endless loop halting the whole cluster

Related topics