Continous GC on Master Node

praveen.vemuri · September 5, 2018, 8:53pm

we have a huge cluster which holds 200TB data.
50 physical servers(150 data nodes). 3 instances running on each node 30gb heap each ( total physical ram 124gb)
3 master nodes ( Total ram 16gb) heap assigned 12gb
3client nodes

Master nodes are continuously doing Garbage Collection. like every 10-20 seconds.
randomly data nodes are unable to get connection to master node and doing failover to another master node and again point back to original master node.

last night all of sudden. all nodes started failing one-by-one.

another thing observed is data nodes are rejecting bulk requests for one index. is that causing all heap memory issues?

Please let me know if you need extra info.

warkolm · September 6, 2018, 1:10am

What version? How many indices and shards?

praveen.vemuri · September 6, 2018, 1:31am

Hi Mark,

elastic version 6.3.1 ( upgraded couple weeks back )
indices 1200
shards 12000
total size 220 TB

Thanks for your response.

warkolm · September 6, 2018, 1:32am

Do you have Monitoring enabled?

praveen.vemuri · September 6, 2018, 1:32am

yes basic monitoring

Christian_Dahlqvist · September 6, 2018, 5:21am

As Elasticsearch use a lot of off-heap memory, it has long been recommended to assign no more than 50% of available RAM to heap. Given that you have at least 90GB assigned to heap out of 124GB (assuming that master and coordinating only nodes run on other hardware), you have greatly exceeded this, which is clearly not recommended nor optimal.

I have never tested running with this much heap on a node, so am not sure how issues round this would manifest themselves.

praveen.vemuri · September 6, 2018, 5:21pm

Thanks Christian.

Yes, i have doubts on that too. but it is running with that settings for over an year( will plan on making it 2 instances per each physical node )

it was opened to other teams a month back and lot of searches started hitting it recently.

at this point i doubt/fix

heavy searches might be causing the issue. planning on adding slow search logging. i just found that huge aggregation searches can break the cluster. if that is true. is there any preventive measure we can take so that users will not break my cluster? is circuit breaker the config i should check.
fix the message format which are getting rejected by elastic for one index
Also planning on increasing heap for master vm(not just heap will add more ram to the vm and then increase heap). does this help or make it worse as it needs more time to GC?

Please let me know if there is anything else i need to check

system · October 4, 2018, 5:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Master node not garbage collecting Elasticsearch	2	545	July 5, 2017
Heap / GC Issues Elasticsearch	9	480	July 6, 2017
Long GC on Elasticsearch master nodes Elasticsearch	4	1654	March 16, 2017
High heap usage Elasticsearch	6	981	March 8, 2019
Elasticsearch heap issues Elasticsearch	4	438	July 5, 2017

Continous GC on Master Node

Related topics