ES 5.1.1 node stuck in endless loop halting the whole cluster

Hi everybody,

we are seeing cluster hangs caused by single nodes beeing stuck in a 100%-cpu thread loop. The offending node has timeouts for various collectors, yet it is not removed from the cluster and all nodes start to log failed search requests like crazy.

The only way to fix the cluster is to restart the offending node. Any idea how we could avoid this issue?

I've attached the log of the master node leading to the issue (pastebin)

And one specimen of the log spam after the cluster hangs:

[2017-01-12T16:44:28,205][DEBUG][o.e.a.s.TransportSearchAction] [es-big-03] [22889914] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [es-big-14][192.168.16.4:9300][indices:data/read/search[phase/fetch/id]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [22889914]
at org.elasticsearch.search.SearchService.findContext(SearchService.java:523) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:490) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.search.SearchTransportService$13.messageReceived(SearchTransportService.java:356) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.search.SearchTransportService$13.messageReceived(SearchTransportService.java:353) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1385) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

Those timeouts don't really tell us much as to why this happened.
What is in the logs on the node?

Sorry for the delayed answer, it's a little hectic over here with the cluster instability. I haven't waded through the 9GB of logs on each node, but I upgraded the cluster to ES 5.1.2 hoping it would fix the situation.

We are still seeing the issue, the heap is slowly filling up to 99%, each GC run collecting less memory and then the nodes begin to 100%-cpu-loop.

Maybe I'm missing a config settings that results in >100% heap consumption, although I would expect a OOM node stop in that case. Our node settings:

# set various heap sizes
indices.memory.index_buffer_size: "4%"
indices.queries.cache.size: "17%"
indices.fielddata.cache.size: "58%"
indices.breaker.fielddata.limit: "61%"

# increase bulk limit
thread_pool.bulk.queue_size: 100000
thread_pool.search.queue_size: 100000
thread_pool.index.queue_size: 100000
thread_pool.get.queue_size: 100000

# higher defaults for http settings
http.max_content_length: "250mb"
http.max_initial_line_length: "16kb"
http.max_header_size: "16kb"

# allow external traffic compression
http.compression: true
http.compression_level: 1
# compress internal traffic
transport.tcp.compress: true

# set threads manually (as we can't unset cluster settings)
processors: 20
thread_pool.bulk.size: 20
thread_pool.search.size: 31
thread_pool.index.size: 20


# raise max shard query limit as we have somewhat more than 1000 shards (defaults to 1000)
action.search.shard_count.limit: 2000

These are way too large. What do you think is going to happen if these queues fill up?

For a better understanding of why this is hurting you, please read my previous thoughts on this subject.

1 Like

We have quite a number of workers, the queue size is just a safe guard to handle temporary spikes when a lot of workers are starting up at the same time (e.g. 1000 hourly workers starting to consume fresh hourly tasks at the top of the hour).

We hardly see any queue usage at all, just a few spikes here and there between 2k and 20k search requests and mostly <1k for bulk requests.

In the mean time we resort to search&destroy scripts that hunt nodes above a heap usage threshold.

Here are the monitoring stats of one of our misbehaving nodes:

I've disabled the compression for the internal and external traffic to rule out a compression memory leak and will also test a lowered query cache size tonight.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.