we have a huge cluster which holds 200TB data.
50 physical servers(150 data nodes). 3 instances running on each node 30gb heap each ( total physical ram 124gb)
3 master nodes ( Total ram 16gb) heap assigned 12gb
3client nodes
Master nodes are continuously doing Garbage Collection. like every 10-20 seconds.
randomly data nodes are unable to get connection to master node and doing failover to another master node and again point back to original master node.
last night all of sudden. all nodes started failing one-by-one.
another thing observed is data nodes are rejecting bulk requests for one index. is that causing all heap memory issues?
As Elasticsearch use a lot of off-heap memory, it has long been recommended to assign no more than 50% of available RAM to heap. Given that you have at least 90GB assigned to heap out of 124GB (assuming that master and coordinating only nodes run on other hardware), you have greatly exceeded this, which is clearly not recommended nor optimal.
I have never tested running with this much heap on a node, so am not sure how issues round this would manifest themselves.
Yes, i have doubts on that too. but it is running with that settings for over an year( will plan on making it 2 instances per each physical node )
it was opened to other teams a month back and lot of searches started hitting it recently.
at this point i doubt/fix
heavy searches might be causing the issue. planning on adding slow search logging. i just found that huge aggregation searches can break the cluster. if that is true. is there any preventive measure we can take so that users will not break my cluster? is circuit breaker the config i should check.
fix the message format which are getting rejected by elastic for one index
Also planning on increasing heap for master vm(not just heap will add more ram to the vm and then increase heap). does this help or make it worse as it needs more time to GC?
Please let me know if there is anything else i need to check
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.