Elasticsearch cluster have millions of pending tasks

fengxiaobai · May 7, 2021, 2:05am

i have a es cluster , 7.4 version .

There are a lot of pending tasks around 8 a.m. every day。 pending task list info：

639479 "source": "ilm-execute-cluster-state-steps",
82666 "source": "ilm-move-to-step",
186 "source": "cluster_reroute(reroute after starting shards)",
12 "source": "ilm-set-step-info",
1 "source": "update-settings",

my cluster info：

my health info:
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1620352457 01:54:17 sre-elasticsearch green 34 30 18001 11216 0 0 0 610953 46.8m 100.0%

my node info:
master node 3
hot node 10
warm node 14
cold node 6
client node 1

my ilm info:
{
"hotwarm-norollover-15days-for-hot-index" : {
"version" : 18,
"modified_date" : "2021-04-25T11:13:59.657Z",
"policy" : {
"phases" : {
"warm" : {
"min_age" : "5d",
"actions" : {
"allocate" : {
"number_of_replicas" : 1,
"include" : { },
"exclude" : { },
"require" : {
"box_type" : "warm"
}
},
"set_priority" : {
"priority" : 100
}
}
},
"cold" : {
"min_age" : "9d",
"actions" : {
"allocate" : {
"number_of_replicas" : 0,
"include" : { },
"exclude" : { },
"require" : {
"box_type" : "cold"
}
},
"freeze" : { }
}
},
"hot" : {
"min_age" : "0ms",
"actions" : {
"set_priority" : {
"priority" : 100
}
}
},
"delete" : {
"min_age" : "15d",
"actions" : {
"delete" : { }
}
}
}
}
}
}

I have already created the index ahead of time at 1:30 a.m。but there are still many pending tasks。I don't know what caused it。
Can anyone tell me why？ thanks。

fengxiaobai · May 8, 2021, 2:01am

anyone konw?

warkolm · May 10, 2021, 12:29am

FYI 7.4 just reached EOL, you will want to upgrade ASAP.

That is not a good idea, it's a waste of resources. Let Elasticsearch create them as needed.

What is the output from the _cluster/stats?pretty&human API?

fengxiaobai · May 10, 2021, 1:27am

@warkolm Thanks for your response. Are you suggesting an upgrade to fix the problem？

GET _cluster/stats?pretty&human ,info list(i get this info at 9:20 a.m; but 8:00 a.m, maybe it's a little different):

{
"_nodes" : {
"total" : 34,
"successful" : 34,
"failed" : 0
},
"cluster_name" : "sre-elasticsearch",
"cluster_uuid" : "1WLLkjixT4WGg1TJ8li2zQ",
"timestamp" : 1620609728494,
"status" : "green",
"indices" : {
"count" : 10676,
"shards" : {
"total" : 18279,
"primaries" : 11312,
"replication" : 0.6158946251768034,
"index" : {
"shards" : {
"min" : 1,
"max" : 16,
"avg" : 1.7121581116523041
},
"primaries" : {
"min" : 1,
"max" : 8,
"avg" : 1.0595728737354815
},
"replication" : {
"min" : 0.0,
"max" : 1.0,
"avg" : 0.6214874484825778
}
}
},
"docs" : {
"count" : 46945133375,
"deleted" : 1689084
},
"store" : {
"size" : "32.6tb",
"size_in_bytes" : 35905473914951
},
"fielddata" : {
"memory_size" : "4.2mb",
"memory_size_in_bytes" : 4406328,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "290.7mb",
"memory_size_in_bytes" : 304876793,
"total_count" : 23108719,
"hit_count" : 4070526,
"miss_count" : 19038193,
"cache_size" : 14305,
"cache_count" : 30540,
"evictions" : 16235
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 138490,
"memory" : "28.6gb",
"memory_in_bytes" : 30721990535,
"terms_memory" : "15gb",
"terms_memory_in_bytes" : 16195784675,
"stored_fields_memory" : "12.1gb",
"stored_fields_memory_in_bytes" : 13012623640,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "164.4mb",
"norms_memory_in_bytes" : 172430208,
"points_memory" : "1.1gb",
"points_memory_in_bytes" : 1268757264,
"doc_values_memory" : "69mb",
"doc_values_memory_in_bytes" : 72394748,
"index_writer_memory" : "2.4gb",
"index_writer_memory_in_bytes" : 2610843890,
"version_map_memory" : "5.1mb",
"version_map_memory_in_bytes" : 5376215,
"fixed_bit_set" : "99.1mb",
"fixed_bit_set_memory_in_bytes" : 103998912,
"max_unsafe_auto_id_timestamp" : 1620604808932,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 34,
"coordinating_only" : 0,
"data" : 30,
"ingest" : 30,
"master" : 3,
"ml" : 1,
"voting_only" : 0
},
"versions" : [
"7.4.0"
],
"os" : {
"available_processors" : 744,
"allocated_processors" : 744,
"names" : [
{
"name" : "Linux",
"count" : 34
}
],
"pretty_names" : [
{
"pretty_name" : "CentOS Linux 7 (Core)",
"count" : 34
}
],
"mem" : {
"total" : "1.4tb",
"total_in_bytes" : 1555088515072,
"free" : "133.2gb",
"free_in_bytes" : 143066071040,
"used" : "1.2tb",
"used_in_bytes" : 1412022444032,
"free_percent" : 9,
"used_percent" : 91
}
},
"process" : {
"cpu" : {
"percent" : 300
},
"open_file_descriptors" : {
"min" : 1472,
"max" : 11777,
"avg" : 6738
}
},
"jvm" : {
"max_uptime" : "102.9d",
"max_uptime_in_millis" : 8898528373,
"versions" : [
{
"version" : "13",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "13+33",
"vm_vendor" : "AdoptOpenJDK",
"bundled_jdk" : true,
"using_bundled_jdk" : true,
"count" : 34
}
],
"mem" : {
"heap_used" : "266.6gb",
"heap_used_in_bytes" : 286297579840,
"heap_max" : "725.9gb",
"heap_max_in_bytes" : 779466833920
},
"threads" : 8241
},
"fs" : {
"total" : "196.9tb",
"total_in_bytes" : 216603026358272,
"free" : "164.1tb",
"free_in_bytes" : 180494084050944,
"available" : "155.7tb",
"available_in_bytes" : 171248387346432
},
"plugins" : ,
"network_types" : {
"transport_types" : {
"security4" : 34
},
"http_types" : {
"security4" : 34
}
},
"discovery_types" : {
"zen" : 34
},
"packaging_types" : [
{
"flavor" : "default",
"type" : "rpm",
"count" : 34
}
]
}
}

warkolm · May 10, 2021, 1:32am

Please format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you

It's possible, there is always bug fixes and performance improvements and it'd be a recommended step as part of troubleshooting.

You should definitely upgrade your JVM, that's pretty old these days.

It also looks like your average shard size is about 5GB, which is inefficient given you have that many shards. You should look to shrink some of your indices and adjust your index creation strategy. Look at using ILM as well.

fengxiaobai · May 10, 2021, 1:58am

@warkolm
Please format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you
about this i'm so sorry . now i konw.
Bye the way , this version of the JDK comes with ES. So i need to upgrade to change the JDK version

Is there any other solution？

fengxiaobai · May 10, 2021, 2:06am

Please tell me，how is this value calculated？

warkolm · May 10, 2021, 2:11am

This;

Divided by;

fengxiaobai · May 10, 2021, 2:32am

Forgive me . I get this value:

echo "scale=2;(32.6*1024)/18279"|bc=1.82

not 5GB. I can't understand.

warkolm · May 10, 2021, 3:26am

Yeah sorry, my math was bad. That's still not good though.

fengxiaobai · May 10, 2021, 5:06am

ok. I'm going to optimize the sharding problem。
thansks。

Christian_Dahlqvist · May 10, 2021, 6:05am

Based on the stats it looks like you are generating over 300 daily indices that as @warkolm pointed out are very small. Given that you seem to be creating around 1 TB of indices (primary and replica) per day that is quite inefficient and means that there are a lot of indices to move at specific times as they likely all created within a small time frame.

I would recommend consolidating the data into a much smaller number of daily indices and potentially increase the number of primary shards if required. Aim for a shard size of 30GB to 50GB and I think the situation should improve.

The retention period on the cold nodes also seem quite short so 1/15 of the data held there (1TB or so?) will be replaced there every day. If you have very slow storage that might also slow things down and contribute to the problem.

fengxiaobai · May 11, 2021, 3:08am

@Christian_Dahlqvist Thanks for your response。

Yes, generate 600+ day-level indexes per day about 8:00 am.

Now, in order to solve this problem,I have already created the index ahead of time at 1:30 a.m

The current scenario is that the business logs are stored in a cluster. But the business log has many categories and creates an index for each category.

At the same time, most services have a very small amount of logging.
Now the minimum shard has been set to 1 in the template.

There is no other solution now.

warkolm · May 11, 2021, 3:14am

Look at using ILM as I mentioned.

Christian_Dahlqvist · May 11, 2021, 5:33am

Why does each category have each own index? Why can you not store all categories in a single index, or at least a considerably smaller number?

The current approach sounds very inefficient and is likely contributing to your problems.

system · June 8, 2021, 5:34am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster status turn yellow everyday morning 8:00 Elasticsearch	18	602	April 21, 2021
Elasticsearch 6.8.6 pending_tasks grows after 24 hours Elasticsearch	13	814	May 1, 2020
Pending tasks queue Elasticsearch	8	3367	July 5, 2017
Elasticsearch pending_tasks Elasticsearch	11	1600	October 29, 2018
Clear cluster pending tasks Elasticsearch	3	8142	July 5, 2017

Elasticsearch cluster have millions of pending tasks

Related topics