My cluster frequently has shards on initializing_shards

Lohanna_Sarah · January 18, 2023, 6:24pm

My cluster frequently has shards in the initializing_shards state. The cluster is running on ks8. Exists any fix that I could do to fix it? The data nodes are restarted at least once a day

warkolm · January 18, 2023, 10:03pm

Welcome to our community!

Why are the nodes restarted?
What do the Elasticsearch logs show?

Christian_Dahlqvist · January 19, 2023, 6:35am

What is the full output of the cluster stats API?

Which version of Elasticsearch are you using?

Why are you restarting Elasticsearch so often?

Do you see shards initialize only after restarts or also at other times?

Lohanna_Sarah · January 19, 2023, 1:05pm

Sometimes deletes the pod during node scale down or node kills the pod because it is out of memory.

We are using the version 7.14.1.

The cluster stats:

{
"_nodes": {
"total": 5,
"successful": 5,
"failed": 0
},
"cluster_name": "diario-alertas",
"cluster_uuid": "T3fKh3vLQJGMbf-OEFHJ-Q",
"timestamp": 1674133629765,
"status": "green",
"indices": {
"count": 4,
"shards": {
"total": 16,
"primaries": 8,
"replication": 1.0,
"index": {
"shards": {
"min": 2,
"max": 10,
"avg": 4.0
},
"primaries": {
"min": 1,
"max": 5,
"avg": 2.0
},
"replication": {
"min": 1.0,
"max": 1.0,
"avg": 1.0
}
}
},
"docs": {
"count": 426210025,
"deleted": 86704911
},
"store": {
"size_in_bytes": 4673224925138,
"total_data_set_size_in_bytes": 4673224925138,
"reserved_in_bytes": 0
},
"fielddata": {
"memory_size_in_bytes": 6686887732,
"evictions": 0
},
"query_cache": {
"memory_size_in_bytes": 0,
"total_count": 0,
"hit_count": 0,
"miss_count": 0,
"cache_size": 0,
"cache_count": 0,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 1187,
"memory_in_bytes": 4510592,
"terms_memory_in_bytes": 2283760,
"stored_fields_memory_in_bytes": 1329688,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 232192,
"points_memory_in_bytes": 0,
"doc_values_memory_in_bytes": 664952,
"index_writer_memory_in_bytes": 0,
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": 1654525198398,
"file_sizes": {}
},
"mappings": {
"field_types": [
{
"name": "boolean",
"count": 1,
"index_count": 1,
"script_count": 0
},
{
"name": "date",
"count": 2,
"index_count": 1,
"script_count": 0
},
{
"name": "float",
"count": 2,
"index_count": 1,
"script_count": 0
},
{
"name": "keyword",
"count": 17,
"index_count": 1,
"script_count": 0
},
{
"name": "long",
"count": 12,
"index_count": 1,
"script_count": 0
},
{
"name": "object",
"count": 8,
"index_count": 1,
"script_count": 0
},
{
"name": "percolator",
"count": 1,
"index_count": 1,
"script_count": 0
},
{
"name": "text",
"count": 16,
"index_count": 1,
"script_count": 0
}
],
"runtime_field_types":
},
"analysis": {
"char_filter_types": ,
"tokenizer_types": ,
"filter_types": ,
"analyzer_types": ,
"built_in_char_filters": ,
"built_in_tokenizers": ,
"built_in_filters": ,
"built_in_analyzers":
},
"versions": [
{
"version": "7.14.1",
"index_count": 4,
"primary_shard_count": 8,
"total_primary_bytes": 2336580588500
}
]
},
"nodes": {
"count": {
"total": 5,
"coordinating_only": 0,
"data": 3,
"data_cold": 0,
"data_content": 0,
"data_frozen": 0,
"data_hot": 0,
"data_warm": 0,
"ingest": 0,
"master": 2,
"ml": 0,
"remote_cluster_client": 0,
"transform": 0,
"voting_only": 0
},
"versions": [
"7.14.1"
],
"os": {
"available_processors": 5,
"allocated_processors": 5,
"names": [
{
"name": "Linux",
"count": 5
}
],
"pretty_names": [
{
"pretty_name": "CentOS Linux 8",
"count": 5
}
],
"architectures": [
{
"arch": "amd64",
"count": 5
}
],
"mem": {
"total_in_bytes": 81604378624,
"free_in_bytes": 984485888,
"used_in_bytes": 80619892736,
"free_percent": 1,
"used_percent": 99
}
},
"process": {
"cpu": {
"percent": 2
},
"open_file_descriptors": {
"min": 379,
"max": 415,
"avg": 393
}
},
"jvm": {
"max_uptime_in_millis": 420614266,
"versions": [
{
"version": "16.0.2",
"vm_name": "OpenJDK 64-Bit Server VM",
"vm_version": "16.0.2+7",
"vm_vendor": "Eclipse Foundation",
"bundled_jdk": true,
"using_bundled_jdk": true,
"count": 5
}
],
"mem": {
"heap_used_in_bytes": 19427339400,
"heap_max_in_bytes": 60557361152
},
"threads": 190
},
"fs": {
"total_in_bytes": 9740860465152,
"free_in_bytes": 5050001469440,
"available_in_bytes": 5049917583360
},
"plugins": ,
"network_types": {
"transport_types": {
"security4": 5
},
"http_types": {
"security4": 5
}
},
"discovery_types": {
"zen": 5
},
"packaging_types": [
{
"flavor": "default",
"type": "docker",
"count": 5
}
],
"ingest": {
"number_of_pipelines": 2,
"processor_stats": {
"gsub": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
},
"script": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
}
}
}
}
}

Christian_Dahlqvist · January 19, 2023, 2:10pm

Adding or removing data nodes will always cause reallocation and rebalancing, so if this happens that is expected. I would generally recommend not autoscaling Elasticsearch for this reason.

This is quite old and I would recommend upgrading.

It looks like you have very, very large shards (average of 292GB?), which will take time and resources to relocate. This likely means rebalancing will be slow and shards will take a long time to initialize. I would recommend increasing the number of primary shards in order to bring the shard size down to around 50GB or so.

Having just 2 master eligible nodes is very bad as a minimum or 3 master eligible nodes are required in order for the cluster to continue operating fully if one of the master eligible nodes fail or becomes unavailable. You should look to increase this to 3.

It looks like your heap is set to more than 50% of available RAM, which is not recommended. Elasticsearch uses off-heap memory and relies on the operating system cache for performance. Ensure you increase RAM to correct this. This could very well be why pods are running out of memory and get killed.

Lohanna_Sarah · January 19, 2023, 4:27pm

Is it possible to configure a limit to the heap memory?

Christian_Dahlqvist · January 19, 2023, 4:35pm

How are the nodes configured? How much resources are asigned?

Lohanna_Sarah · January 19, 2023, 5:05pm

24gb of ram and the hard disk has 3T

Christian_Dahlqvist · January 19, 2023, 6:44pm

Are all nodes the same specification? What is the heap size set to?

Lohanna_Sarah · January 19, 2023, 7:44pm

Yes, using the cat API I had the following values:
heap.current heap.max name
6.5gb 18gb diario-alertas-es-data-nodes-2
257.5mb 1.1gb diario-alertas-es-master-nodes-0
509.4mb 1.1gb diario-alertas-es-master-nodes-1
8.4gb 18gb diario-alertas-es-data-nodes-0
966.6mb 18gb diario-alertas-es-data-nodes-1

Christian_Dahlqvist · January 19, 2023, 8:03pm

So the nodes have 18GB heap on 24GB RAM? That should be no more than 50%, so i would recommend increasing RAM or reducing yhe heap size (assuming this does not lead to issues with GC).

warkolm · January 19, 2023, 9:54pm

@Lohanna_Sarah just a note to please format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you

Lohanna_Sarah · January 19, 2023, 9:57pm

Thanks, I will try to reduce the heap size.

system · February 16, 2023, 9:58pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shards keep re-initializing Elasticsearch	8	1499	June 24, 2020
Shards Initializing Indefinitely? Elasticsearch	10	5075	October 24, 2017
Shards initializing randomly Elasticsearch	3	341	June 7, 2020
Nodes restarting with shards initializing Elasticsearch	4	302	December 23, 2021
Shard stucked in initializing state (elasticsearch crash test) Elasticsearch	3	487	July 6, 2017

My cluster frequently has shards on initializing_shards

Related topics