Some stats about the cluster
3 master nodes
12 hot data+ingest nodes
16 vCPU 32GB mem (16GB heap) each
12 warm data+ingest nodes
8 vCPU 32GB mem (16GB heap) each
1 ingest node (added recently for testing)
Ingestion is done with a ALB that routes to all the non-master nodes (warm+hot+client)
data is sent from a flink job using the java ES client + filebeat pods
indices have no static mappings and dynamic mapping is used
Some indices have pretty complex structure but nothing too nested or exotic
Hot nodes hold data 3-4 days back and warm nodes hold all the rest of the data (depends on index that can be 30 days back)
Average number of shards per hot node is 40 with total size of 1.3TB (1.5B docs)
Since we reduces the amount of shards per index significantly we have no issues during the day ingesting, indexing and querying data.
However, every day at 00:00 when our filebeat pods write to the new indexes, the task queue completely fills up.
It takes about 5 hours to stop seeing rejections completely and i know (but did not measure) that it takes some time for the new indices to show up
How can this be improved? i guess that upgrading to a new version can help and we are planning to do so but we want to find a solution in the meantime.
I believe the amount of index/shards per node is completely within recommended limits. is it a CPU intensive operation? memory intensive?