Hi,
We are seeing hot nodes going out frequently(atleast once in a day) with the below error and we need to restart the services manually to add the node back to the cluster.
We have 8 hot nodes and Most of the time nodes go down with heap memory consumption more than 90% and long GC pauses (sometimes in Minutes).
[gc][69753] overhead, spent [54s] collecting in the last [54.9s]
[gc][old][69754][152] duration [42s], collections [1]/[42.5s], total [42s]/[3.5m], memory [30.3gb]->[30.7gb]/[30.9gb], all_pools {[young] [38.9mb]->[389.8mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[old] [30.3gb]->[30.3gb]/[30.3gb]}
ERROR Recovering from [gc][old][69754][152] duration [42s], collections [1]/[42.5s], total [42s]/[3.5m], memory [30.3gb]->[30.7gb]/[30.9gb], all_pools {[young] [38.9mb]->[389.8mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[old] [30.3gb]->[30.3gb]/[30.3gb]}
Around 76 TB of data is stored in our cluster and around 10 TB of data in HOT nodes(1.7 tb of space is allocated on each hot node and around 300-400gb of free space is available).
We have allocated 31GB of memory on each hot node and around 350 shards currently exists on each hot node.
Please suggest is there any way we can check what is causing huge memory consumption on the hot nodes,I was thinking ingestion might be one of the reason.
Is there anything which we can check specifically to find the root cause.
Most of the time we face this issue only on hot nodes.
ES version:6.2
Any help here is much appreciated.
Thanks,
Aravind