EsRejectedExecutionException when just refresh kibana

(Zt Zeng) #1

Recently the Kibana fail to show any content. It just show empty or give us the error. Check the log of Elasticsearch, we found the very long EsRejectedExecutionException for search queue:

rejected execution of org.elasticsearch.transport.TransportService$7@16130b66 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6c9b1b8d[Running, pool size = 25, active threads = 25, queued tasks = 1832, completed tasks = 200517]]

Following the suggestions from this post, we have checked slow query, gc, io but found nothing.

    "gc" : {
      "collectors" : {
        "young" : {
          "collection_count" : 1662,
          "collection_time" : "53.1s",
          "collection_time_in_millis" : 53169
        "old" : {
          "collection_count" : 1,
          "collection_time" : "229ms",
          "collection_time_in_millis" : 229

And at the same time, to query node stats it give us:

    "search" : {
      "threads" : 25,
      "queue" : 0,
      "active" : 0,
      "rejected" : 37364,
      "largest" : 25,
      "completed" : 204375

It seems that this conflict with the log content: queued task is 0?

By the way, just a simple refresh of Kibana or change of filter will cause very long EsRejectedExecutionException.

(Bernt Rostad) #2

I've also had issues with Kibana slowing down and failing in a percolator cluster we're running and from what you show it seems to be the same reason; too many tasks for the (which is 1000 by default) causing Elasticsearch to reject new tasks. Your node stats show you've already had 37,364 tasks rejected and 204,375 completed since the last reboot - so on that node 18% or almost 1 out of every 5 tasks are being rejected. Many of those tasks are likely to be from Kibana which is why it struggles.

A growing queue of tasks could indicate that your cluster is slowing down for some reason (perhaps over-sharding causing a huge cluster state?) or that the search volume is too high for the given number of nodes. The solution in my company was to add two more nodes to the cluster, reducing the number of tasks per node to well below the threshold and reducing the number of rejections to virtually zero. Our Kibana instance has behaved well since then.

(Zt Zeng) #3

Can you share the what a proper shard number for a 8G, 32 core server node?

(Bernt Rostad) #4

I'm afraid there is no simple answer to that question, it all depends on the size and form of the documents in your indices, but in general it's better with few primary shards per index rather than many, to reduce the cluster state size.

In our case we aim at shard sizes of 20-50 G so for an index of 10-30 G we go for 1 primary shard while for an index of 100 G we opt for 3 (or perhaps 4 if we expect the index to grow) primary shards. Of course, sometimes it's hard to tell ahead of time so if an index turns out larger than we expected we simply reindex it into a new index with more or fewer primary shards to get the proper shard size.

Good luck!

(Mark Walkom) #5

What does _cat/nodes?v show?

(Zt Zeng) #6

Thanks for your time, I solved my problem. It is because the log file rolling too often which create too many indices and overwhelming the queue.

(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.