We have a 15 node ES cluster with 10 data + 2 client + 3 dedicated master nodes.
Data nodes are allocated 20GB RAM and 10GB heap.
This cluster is to capture logs flowing through logstash.
The setup is to create 10 shards per index . We create indices once per hour grouped by different categories of logs. Roughly we end up with 1000 indices translating to 10,000 shards. We close indices older than 7 days and delete indices older than 10 days.
With the current setup, after few days we see heap utilization reaching to 90% on all data nodes making them do only GC and hence break the entire cluster. Usually this forces us to restart our data nodes.
What are all the possible reasons on why we would have our heap going so high ? What are all the metrics that we should monitor to help us and what could we do to avoid this scenario.
Any pointers is greatly appreciated.