Kibana 5.6.9 Advanced Node Tab Throwing a 503


(Struve) #1

My problem is pretty similar to this one. We have a dedicated monitoring cluster with 2 nodes that we upgraded to 5.6.9 from 5.4 a month ago. Recently, we have not been able to load the "Advanced Tab" for any individual node. Every time we try we get a 503 search_phase_execution_exception. After looking into the error logs I found that this is due to us blowing through the thread limit for searches.

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.action.search.FetchSearchPhase$1@6fed4e8b on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@39018d64[Running, pool size = 7, active threads = 7, queued tasks = 1003, completed tasks = 111115001]]

This monitoring cluster is not heavily used and when I pull up the stats for searching on it you can see it really is not doing anything until we try to load one of the monitoring pages. Both node search stats look similar to this one.

Other Info:

  • Timeframe I am looking at is 1 hour
  • All indexes on the cluster are green
  • We have tried to close old indexes in an effort to solve this issue but it has not worked. The cluster currently has 280 open .monitoring-es and .monitoring-kibana indexes

Unlike our main cluster which I have control over how we handle searching I don't know much about how Kibana executes it searches so I am not sure what the best way to go about fixing this is.

Thanks in advanced for any help!

Molly


(Tim Sullivan) #2

Hi, Molly,

Sorry to hear your team is having this frustrating issue.

You can take a look at the real-time search queue size in Kibana by making a line chart on the monitoring data:

  • Y-Axis: Max
  • Field: node_stats.thread_pool.search.queue
  • X-Axis: Date histogram
  • Field: timestamp
  • Interval: Custom / 10s

That will give you a chart that looks kind of like this:

As you can see, I tried to put some search load on my cluster, and I did that by clicking the pause/refresh button in the Advanced Node page repeatedly.

If you close all other browser pages searching against the monitoring data, and watch that chart for awhile, you should be able to see the queue go down. I would wait quite a bit and see you can get it to go really down. Once it is really down, you should be able to open the Advanced Node page in the monitoring application.


(Struve) #3

Thanks for the response @tsullivan!

My visualization looks very similar to yours. It never really increases past two and when I try to load that advanced node page the page throws a 503 and there is no change in this graph. If I view the node_stats.thread_pool.search.rejected max it's a flatline which does not seem right considering the error I am seeing.


(Struve) #4

We seem to have fixed the problem by closing all our indexes up until March 1st. My guess is the queuing has to do with the number of indexes if tries to search for each request. I noticed when I viewed the logs that it does not limit the indexes by date despite the time window requested for a search.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.