My cluster frequently has shards on initializing_shards

My cluster frequently has shards in the initializing_shards state. The cluster is running on ks8. Exists any fix that I could do to fix it? The data nodes are restarted at least once a day

Welcome to our community! :smiley:

Why are the nodes restarted?
What do the Elasticsearch logs show?

What is the full output of the cluster stats API?

Which version of Elasticsearch are you using?

Why are you restarting Elasticsearch so often?

Do you see shards initialize only after restarts or also at other times?

Sometimes deletes the pod during node scale down or node kills the pod because it is out of memory.

We are using the version 7.14.1.

The cluster stats:

{
"_nodes": {
"total": 5,
"successful": 5,
"failed": 0
},
"cluster_name": "diario-alertas",
"cluster_uuid": "T3fKh3vLQJGMbf-OEFHJ-Q",
"timestamp": 1674133629765,
"status": "green",
"indices": {
"count": 4,
"shards": {
"total": 16,
"primaries": 8,
"replication": 1.0,
"index": {
"shards": {
"min": 2,
"max": 10,
"avg": 4.0
},
"primaries": {
"min": 1,
"max": 5,
"avg": 2.0
},
"replication": {
"min": 1.0,
"max": 1.0,
"avg": 1.0
}
}
},
"docs": {
"count": 426210025,
"deleted": 86704911
},
"store": {
"size_in_bytes": 4673224925138,
"total_data_set_size_in_bytes": 4673224925138,
"reserved_in_bytes": 0
},
"fielddata": {
"memory_size_in_bytes": 6686887732,
"evictions": 0
},
"query_cache": {
"memory_size_in_bytes": 0,
"total_count": 0,
"hit_count": 0,
"miss_count": 0,
"cache_size": 0,
"cache_count": 0,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 1187,
"memory_in_bytes": 4510592,
"terms_memory_in_bytes": 2283760,
"stored_fields_memory_in_bytes": 1329688,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 232192,
"points_memory_in_bytes": 0,
"doc_values_memory_in_bytes": 664952,
"index_writer_memory_in_bytes": 0,
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": 1654525198398,
"file_sizes": {}
},
"mappings": {
"field_types": [
{
"name": "boolean",
"count": 1,
"index_count": 1,
"script_count": 0
},
{
"name": "date",
"count": 2,
"index_count": 1,
"script_count": 0
},
{
"name": "float",
"count": 2,
"index_count": 1,
"script_count": 0
},
{
"name": "keyword",
"count": 17,
"index_count": 1,
"script_count": 0
},
{
"name": "long",
"count": 12,
"index_count": 1,
"script_count": 0
},
{
"name": "object",
"count": 8,
"index_count": 1,
"script_count": 0
},
{
"name": "percolator",
"count": 1,
"index_count": 1,
"script_count": 0
},
{
"name": "text",
"count": 16,
"index_count": 1,
"script_count": 0
}
],
"runtime_field_types":
},
"analysis": {
"char_filter_types": ,
"tokenizer_types": ,
"filter_types": ,
"analyzer_types": ,
"built_in_char_filters": ,
"built_in_tokenizers": ,
"built_in_filters": ,
"built_in_analyzers":
},
"versions": [
{
"version": "7.14.1",
"index_count": 4,
"primary_shard_count": 8,
"total_primary_bytes": 2336580588500
}
]
},
"nodes": {
"count": {
"total": 5,
"coordinating_only": 0,
"data": 3,
"data_cold": 0,
"data_content": 0,
"data_frozen": 0,
"data_hot": 0,
"data_warm": 0,
"ingest": 0,
"master": 2,
"ml": 0,
"remote_cluster_client": 0,
"transform": 0,
"voting_only": 0
},
"versions": [
"7.14.1"
],
"os": {
"available_processors": 5,
"allocated_processors": 5,
"names": [
{
"name": "Linux",
"count": 5
}
],
"pretty_names": [
{
"pretty_name": "CentOS Linux 8",
"count": 5
}
],
"architectures": [
{
"arch": "amd64",
"count": 5
}
],
"mem": {
"total_in_bytes": 81604378624,
"free_in_bytes": 984485888,
"used_in_bytes": 80619892736,
"free_percent": 1,
"used_percent": 99
}
},
"process": {
"cpu": {
"percent": 2
},
"open_file_descriptors": {
"min": 379,
"max": 415,
"avg": 393
}
},
"jvm": {
"max_uptime_in_millis": 420614266,
"versions": [
{
"version": "16.0.2",
"vm_name": "OpenJDK 64-Bit Server VM",
"vm_version": "16.0.2+7",
"vm_vendor": "Eclipse Foundation",
"bundled_jdk": true,
"using_bundled_jdk": true,
"count": 5
}
],
"mem": {
"heap_used_in_bytes": 19427339400,
"heap_max_in_bytes": 60557361152
},
"threads": 190
},
"fs": {
"total_in_bytes": 9740860465152,
"free_in_bytes": 5050001469440,
"available_in_bytes": 5049917583360
},
"plugins": ,
"network_types": {
"transport_types": {
"security4": 5
},
"http_types": {
"security4": 5
}
},
"discovery_types": {
"zen": 5
},
"packaging_types": [
{
"flavor": "default",
"type": "docker",
"count": 5
}
],
"ingest": {
"number_of_pipelines": 2,
"processor_stats": {
"gsub": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
},
"script": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
}
}
}
}
}

Adding or removing data nodes will always cause reallocation and rebalancing, so if this happens that is expected. I would generally recommend not autoscaling Elasticsearch for this reason.

This is quite old and I would recommend upgrading.

It looks like you have very, very large shards (average of 292GB?), which will take time and resources to relocate. This likely means rebalancing will be slow and shards will take a long time to initialize. I would recommend increasing the number of primary shards in order to bring the shard size down to around 50GB or so.

Having just 2 master eligible nodes is very bad as a minimum or 3 master eligible nodes are required in order for the cluster to continue operating fully if one of the master eligible nodes fail or becomes unavailable. You should look to increase this to 3.

It looks like your heap is set to more than 50% of available RAM, which is not recommended. Elasticsearch uses off-heap memory and relies on the operating system cache for performance. Ensure you increase RAM to correct this. This could very well be why pods are running out of memory and get killed.

Is it possible to configure a limit to the heap memory?

How are the nodes configured? How much resources are asigned?

24gb of ram and the hard disk has 3T

Are all nodes the same specification? What is the heap size set to?

Yes, using the cat API I had the following values:
heap.current heap.max name
6.5gb 18gb diario-alertas-es-data-nodes-2
257.5mb 1.1gb diario-alertas-es-master-nodes-0
509.4mb 1.1gb diario-alertas-es-master-nodes-1
8.4gb 18gb diario-alertas-es-data-nodes-0
966.6mb 18gb diario-alertas-es-data-nodes-1

So the nodes have 18GB heap on 24GB RAM? That should be no more than 50%, so i would recommend increasing RAM or reducing yhe heap size (assuming this does not lead to issues with GC).

@Lohanna_Sarah just a note to please format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you :slight_smile:

1 Like

Thanks, I will try to reduce the heap size.