Current running tasks keep increasing on Master node

Ganesh_Ravi · July 10, 2017, 4:16pm

Elasticsearch version is 2.3.3. The cluster only contains one data node and 3 client nodes
In 3.5 hours after starting up Elasticsearch service, the task count when using the url format http://localhost:9200/_tasks?nodes=<master-node-name> is 175,570. The task count only keeps growing and eventually Elasticsearch runs our of heap space and throws a OutOfMemory error.

Most of the tasks belong to the three following action type:

internal:discovery/zen/fd/master_ping (task count = 38,391)
internal:discovery/zen/fd/ping (task count = 37,193)
indices:data/read/search[phase/scan] (task count = 4,053)
indices:data/read/search[phase/scan/scroll] (task count = 55,198)

The total number of queries logged by the index-search-slowlog after enabling all query logging is 152,663 (in 3.5 hours). All these are small queries and do not take more than 1ms.

From analysing the threadpool, it seems the bulk, index, refresh and search threadpool is fully utilized with size = max.

The cluster only contains one data node and 3 client nodes. The current max heap size is 4gb. When I analysed the heap dump when the heap usage was 90%, most of the heap (80%) was occupied by org.elasticsearch.tasks.Task instances and supporting instances.

spinscale · July 11, 2017, 9:35am

you have a crazy amount of scroll searches going on. Is every search you execute a scroll search? If so, you should maybe change that to regular search? Any other special configuration? How many indices, how many shards, etc?

Ganesh_Ravi · July 11, 2017, 10:02am

Thanks Alexander. We are working on converting at least half of those scroll queries (which returns less than 50 hits in total) to hits mode queries. Please see the shards configuration below:

{
  "_shards" : {
    "total" : 1649,
    "successful" : 846,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      "docs" : {
        "count" : 43871088,
        "deleted" : 101
      }

Total number of indices is 204. At least half of our indices have 5 primary shards and 1 replica shards. We recently modified the template to use only 1 primary shard and 0 replica shard.

Please help me with the two following questions:

Why there are lots of discovery/zen/fd/ping tasks. Are these suppose to be long running tasks?
Also, is it expected behaviour for tasks to be piling up like this? I thought each task has its own queue (with almost all queue having size less than 1000) and once those queues are full, any further tasks will be rejected?

Ganesh_Ravi · July 11, 2017, 12:36pm

Things got even weirder after replacing our scroll queries to Hits queries. After just a few minutes of Starting up Elasticsearch service, now the tasks cat results are as follows:

Total tasks = 235,869

internal:discovery/zen/fd/ping = 167,873
indices:* = 1,414
cluster:monitor/nodes/info = 63,482

Hot threads ran on the master node with ignore_idle=false, interval=5s and threads=200 also gives interesting output:

Threads under state "EPollArrayWrapper.epollWait" = 99
Threads under state "Unsafe.park" = 89

Ganesh_Ravi · July 13, 2017, 5:27pm

Figured it out. The problem was with search-guard-ssl plugin. I was using the version 2.3.3.15 and this is apparently a bug in that version. I tried the latest version 2.3.3.21 and I don't see any outstanding tasks (except the "task list" task itself).

Basically search-guard-ssl-2.3.3.15 was not calling the unregisterTask() method on tasks after consuming them. So the tasks still remained in the list and being shown as still running.

system · August 10, 2017, 5:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Lack of memory? Elasticsearch	11	804	July 6, 2017
Cluster locks up Elasticsearch	9	1676	July 6, 2017
Sudden 100% CPU usage by ElasticSearch Elasticsearch	8	935	July 6, 2017
Elastic search High threads , 100% utilization of non heap memory Elasticsearch	3	444	July 6, 2017
Massive performance issues on our production cluster Elasticsearch	5	2592	July 6, 2017

Current running tasks keep increasing on Master node

Related topics