Hi Community,
May I ask for help:
We have elastic v7.17.0 with 43 nodes. Sizing 2TB SSD, 8cores, 32GB RAM, 16GB Heap.
Currently about 2400 indices and 6400shards.
some nodes have less shards but have disk full which causes node cpu at 100% and it impacts whole cluster that all ingest pipelines are being rejected with error
{'error': {'root_cause': [{'type': 'es_rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=750630256, replica_bytes=0, all_bytes=750630256,
coordinating_operation_bytes=1477151, max_coordinating_and_primary_bytes=751619276]'}], 'type': 'es_rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=750630256, replica_bytes=0, all_bytes=750630256, coordinating_operation_bytes=1477151, max_coordinating_and_primary_bytes=751619276]'}, 'status': 429}
shards are automatically distributed by elastic among 43 nodes, all indices are timeseries indices having templates and ILM doing rollover daily/weekly/monthly depends on size of datasource.
there is a mix of very small indices xxMB size and huge indices xxGB size. We keep recommended defaults 50GB shard max size, but some datasources are very small .
Combination of small and large indices is a problem for elastic it happened that it allocates small indices on one node and huge end up on another node and it results in situation where some nodes have full storage and other are half empty byt Elatic allocates data to nodes with least number of shards.
My workaround is to remove node from cluster and reinsert it back in few hours later.
The problem is the situation occurs repeatedly every few days and it breaks the production.
GET _cat/allocation?v&s=node
shards disk.indices disk.used disk.avail disk.total disk.percent node
136 1.6tb 1.6tb 232.2gb 1.9tb 88 tela01prahkz --> problem node smalles num of shard (136) but disk full
134 1.6tb 1.7tb 201.8gb 1.9tb 89 tela02prahkz --> problem node smalles num of shard (134) but disk full
179 643.8gb 736gb 1.2tb 1.9tb 37 tela03prahkz
179 1tb 1.1tb 822.2gb 1.9tb 58 tela04prahkz
179 738.2gb 836.4gb 1.1tb 1.9tb 42 tela05prahkz
....
thank you for any advice
I already reported this issue before but no resolution: