Current running tasks keep increasing on Master node

Elasticsearch version is 2.3.3. The cluster only contains one data node and 3 client nodes
In 3.5 hours after starting up Elasticsearch service, the task count when using the url format http://localhost:9200/_tasks?nodes=<master-node-name> is 175,570. The task count only keeps growing and eventually Elasticsearch runs our of heap space and throws a OutOfMemory error.

Most of the tasks belong to the three following action type:

  1. internal:discovery/zen/fd/master_ping (task count = 38,391)
  2. internal:discovery/zen/fd/ping (task count = 37,193)
  3. indices:data/read/search[phase/scan] (task count = 4,053)
  4. indices:data/read/search[phase/scan/scroll] (task count = 55,198)

The total number of queries logged by the index-search-slowlog after enabling all query logging is 152,663 (in 3.5 hours). All these are small queries and do not take more than 1ms.

From analysing the threadpool, it seems the bulk, index, refresh and search threadpool is fully utilized with size = max.

The cluster only contains one data node and 3 client nodes. The current max heap size is 4gb. When I analysed the heap dump when the heap usage was 90%, most of the heap (80%) was occupied by org.elasticsearch.tasks.Task instances and supporting instances.

you have a crazy amount of scroll searches going on. Is every search you execute a scroll search? If so, you should maybe change that to regular search? Any other special configuration? How many indices, how many shards, etc?

Thanks Alexander. We are working on converting at least half of those scroll queries (which returns less than 50 hits in total) to hits mode queries. Please see the shards configuration below:

{
  "_shards" : {
    "total" : 1649,
    "successful" : 846,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      "docs" : {
        "count" : 43871088,
        "deleted" : 101
      }

Total number of indices is 204. At least half of our indices have 5 primary shards and 1 replica shards. We recently modified the template to use only 1 primary shard and 0 replica shard.

Please help me with the two following questions:

  1. Why there are lots of discovery/zen/fd/ping tasks. Are these suppose to be long running tasks?
  2. Also, is it expected behaviour for tasks to be piling up like this? I thought each task has its own queue (with almost all queue having size less than 1000) and once those queues are full, any further tasks will be rejected?

Things got even weirder after replacing our scroll queries to Hits queries. After just a few minutes of Starting up Elasticsearch service, now the tasks cat results are as follows:

Total tasks = 235,869

  1. internal:discovery/zen/fd/ping = 167,873
  2. indices:* = 1,414
  3. cluster:monitor/nodes/info = 63,482

Hot threads ran on the master node with ignore_idle=false, interval=5s and threads=200 also gives interesting output:

  1. Threads under state "EPollArrayWrapper.epollWait" = 99
  2. Threads under state "Unsafe.park" = 89

Figured it out. The problem was with search-guard-ssl plugin. I was using the version 2.3.3.15 and this is apparently a bug in that version. I tried the latest version 2.3.3.21 and I don't see any outstanding tasks (except the "task list" task itself).

Basically search-guard-ssl-2.3.3.15 was not calling the unregisterTask() method on tasks after consuming them. So the tasks still remained in the list and being shown as still running.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.