Elasticsearch cluster have millions of pending tasks

i have a es cluster , 7.4 version .

There are a lot of pending tasks around 8 a.m. every day。 pending task list info:

639479 "source": "ilm-execute-cluster-state-steps",
82666 "source": "ilm-move-to-step",
186 "source": "cluster_reroute(reroute after starting shards)",
12 "source": "ilm-set-step-info",
1 "source": "update-settings",

my cluster info:

my health info:
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1620352457 01:54:17 sre-elasticsearch green 34 30 18001 11216 0 0 0 610953 46.8m 100.0%

my node info:
master node 3
hot node 10
warm node 14
cold node 6
client node 1

my ilm info:
{
"hotwarm-norollover-15days-for-hot-index" : {
"version" : 18,
"modified_date" : "2021-04-25T11:13:59.657Z",
"policy" : {
"phases" : {
"warm" : {
"min_age" : "5d",
"actions" : {
"allocate" : {
"number_of_replicas" : 1,
"include" : { },
"exclude" : { },
"require" : {
"box_type" : "warm"
}
},
"set_priority" : {
"priority" : 100
}
}
},
"cold" : {
"min_age" : "9d",
"actions" : {
"allocate" : {
"number_of_replicas" : 0,
"include" : { },
"exclude" : { },
"require" : {
"box_type" : "cold"
}
},
"freeze" : { }
}
},
"hot" : {
"min_age" : "0ms",
"actions" : {
"set_priority" : {
"priority" : 100
}
}
},
"delete" : {
"min_age" : "15d",
"actions" : {
"delete" : { }
}
}
}
}
}
}

I have already created the index ahead of time at 1:30 a.m。but there are still many pending tasks。I don't know what caused it。
Can anyone tell me why? thanks。

anyone konw?

FYI 7.4 just reached EOL, you will want to upgrade ASAP.

That is not a good idea, it's a waste of resources. Let Elasticsearch create them as needed.

What is the output from the _cluster/stats?pretty&human API?

@warkolm Thanks for your response. Are you suggesting an upgrade to fix the problem?

GET _cluster/stats?pretty&human ,info list(i get this info at 9:20 a.m; but 8:00 a.m, maybe it's a little different):

{
"_nodes" : {
"total" : 34,
"successful" : 34,
"failed" : 0
},
"cluster_name" : "sre-elasticsearch",
"cluster_uuid" : "1WLLkjixT4WGg1TJ8li2zQ",
"timestamp" : 1620609728494,
"status" : "green",
"indices" : {
"count" : 10676,
"shards" : {
"total" : 18279,
"primaries" : 11312,
"replication" : 0.6158946251768034,
"index" : {
"shards" : {
"min" : 1,
"max" : 16,
"avg" : 1.7121581116523041
},
"primaries" : {
"min" : 1,
"max" : 8,
"avg" : 1.0595728737354815
},
"replication" : {
"min" : 0.0,
"max" : 1.0,
"avg" : 0.6214874484825778
}
}
},
"docs" : {
"count" : 46945133375,
"deleted" : 1689084
},
"store" : {
"size" : "32.6tb",
"size_in_bytes" : 35905473914951
},
"fielddata" : {
"memory_size" : "4.2mb",
"memory_size_in_bytes" : 4406328,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "290.7mb",
"memory_size_in_bytes" : 304876793,
"total_count" : 23108719,
"hit_count" : 4070526,
"miss_count" : 19038193,
"cache_size" : 14305,
"cache_count" : 30540,
"evictions" : 16235
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 138490,
"memory" : "28.6gb",
"memory_in_bytes" : 30721990535,
"terms_memory" : "15gb",
"terms_memory_in_bytes" : 16195784675,
"stored_fields_memory" : "12.1gb",
"stored_fields_memory_in_bytes" : 13012623640,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "164.4mb",
"norms_memory_in_bytes" : 172430208,
"points_memory" : "1.1gb",
"points_memory_in_bytes" : 1268757264,
"doc_values_memory" : "69mb",
"doc_values_memory_in_bytes" : 72394748,
"index_writer_memory" : "2.4gb",
"index_writer_memory_in_bytes" : 2610843890,
"version_map_memory" : "5.1mb",
"version_map_memory_in_bytes" : 5376215,
"fixed_bit_set" : "99.1mb",
"fixed_bit_set_memory_in_bytes" : 103998912,
"max_unsafe_auto_id_timestamp" : 1620604808932,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 34,
"coordinating_only" : 0,
"data" : 30,
"ingest" : 30,
"master" : 3,
"ml" : 1,
"voting_only" : 0
},
"versions" : [
"7.4.0"
],
"os" : {
"available_processors" : 744,
"allocated_processors" : 744,
"names" : [
{
"name" : "Linux",
"count" : 34
}
],
"pretty_names" : [
{
"pretty_name" : "CentOS Linux 7 (Core)",
"count" : 34
}
],
"mem" : {
"total" : "1.4tb",
"total_in_bytes" : 1555088515072,
"free" : "133.2gb",
"free_in_bytes" : 143066071040,
"used" : "1.2tb",
"used_in_bytes" : 1412022444032,
"free_percent" : 9,
"used_percent" : 91
}
},
"process" : {
"cpu" : {
"percent" : 300
},
"open_file_descriptors" : {
"min" : 1472,
"max" : 11777,
"avg" : 6738
}
},
"jvm" : {
"max_uptime" : "102.9d",
"max_uptime_in_millis" : 8898528373,
"versions" : [
{
"version" : "13",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "13+33",
"vm_vendor" : "AdoptOpenJDK",
"bundled_jdk" : true,
"using_bundled_jdk" : true,
"count" : 34
}
],
"mem" : {
"heap_used" : "266.6gb",
"heap_used_in_bytes" : 286297579840,
"heap_max" : "725.9gb",
"heap_max_in_bytes" : 779466833920
},
"threads" : 8241
},
"fs" : {
"total" : "196.9tb",
"total_in_bytes" : 216603026358272,
"free" : "164.1tb",
"free_in_bytes" : 180494084050944,
"available" : "155.7tb",
"available_in_bytes" : 171248387346432
},
"plugins" : ,
"network_types" : {
"transport_types" : {
"security4" : 34
},
"http_types" : {
"security4" : 34
}
},
"discovery_types" : {
"zen" : 34
},
"packaging_types" : [
{
"flavor" : "default",
"type" : "rpm",
"count" : 34
}
]
}
}

Please format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you :slight_smile:

It's possible, there is always bug fixes and performance improvements and it'd be a recommended step as part of troubleshooting.

You should definitely upgrade your JVM, that's pretty old these days.

It also looks like your average shard size is about 5GB, which is inefficient given you have that many shards. You should look to shrink some of your indices and adjust your index creation strategy. Look at using ILM as well.

@warkolm
Please format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you
about this i'm so sorry . now i konw.
Bye the way , this version of the JDK comes with ES. So i need to upgrade to change the JDK version

Is there any other solution?

Please tell me,how is this value calculated?

This;

Divided by;

Forgive me . I get this value:

echo "scale=2;(32.6*1024)/18279"|bc=1.82

not 5GB. I can't understand.

Yeah sorry, my math was bad. That's still not good though.

ok. I'm going to optimize the sharding problem。
thansks。

Based on the stats it looks like you are generating over 300 daily indices that as @warkolm pointed out are very small. Given that you seem to be creating around 1 TB of indices (primary and replica) per day that is quite inefficient and means that there are a lot of indices to move at specific times as they likely all created within a small time frame.

I would recommend consolidating the data into a much smaller number of daily indices and potentially increase the number of primary shards if required. Aim for a shard size of 30GB to 50GB and I think the situation should improve.

The retention period on the cold nodes also seem quite short so 1/15 of the data held there (1TB or so?) will be replaced there every day. If you have very slow storage that might also slow things down and contribute to the problem.

@Christian_Dahlqvist Thanks for your response。

Yes, generate 600+ day-level indexes per day about 8:00 am.

Now, in order to solve this problem,I have already created the index ahead of time at 1:30 a.m

The current scenario is that the business logs are stored in a cluster. But the business log has many categories and creates an index for each category.

At the same time, most services have a very small amount of logging.
Now the minimum shard has been set to 1 in the template.

There is no other solution now. :joy:

Look at using ILM as I mentioned.

Why does each category have each own index? Why can you not store all categories in a single index, or at least a considerably smaller number?

The current approach sounds very inefficient and is likely contributing to your problems.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.