Elasticsearch heap issues


(Anusha Dharmalingam) #1

Hi,

We have a 15 node ES cluster with 10 data + 2 client + 3 dedicated master nodes.
Data nodes are allocated 20GB RAM and 10GB heap.
This cluster is to capture logs flowing through logstash.

The setup is to create 10 shards per index . We create indices once per hour grouped by different categories of logs. Roughly we end up with 1000 indices translating to 10,000 shards. We close indices older than 7 days and delete indices older than 10 days.

With the current setup, after few days we see heap utilization reaching to 90% on all data nodes making them do only GC and hence break the entire cluster. Usually this forces us to restart our data nodes.

What are all the possible reasons on why we would have our heap going so high ? What are all the metrics that we should monitor to help us and what could we do to avoid this scenario.

Any pointers is greatly appreciated.


(Christian Dahlqvist) #2

Based on your description it sounds like you are generating far too many shards. Having a very large number of small shards can waste a lot of resources, e.g. heap and file handles, which has been discussed numerous times here in the forums. A reasonably common shard size for logging use cases is 5-50GB, so if your shards currently are a lot smaller than that I would recommend looking into reducing the number of shards per index, switch to daily indices and/or merge indices.


(Anusha Dharmalingam) #3

Hi Christian,

Thanks for your response. We are going to try reducing the shard size.

On the same point, currently when one of our data node (out of 10 ) goes down, around 8000 shards gets into UNASSIGNED state. Under this condition, no data gets indexed. The indexing requests gets timeouts. It appears the cluster gets busy in reallocating the shards rather than having new documents indexed. Is their any configuration that is recommended to balance the cluster load between redistributing the shards and indexing new documents?

Thanks,
Anusha.


(Ravi Shanker Reddy) #4

if you are bringing down the node forcefully then the "routing allocation" is available for you.


(system) #5