Continous GC on Master Node

we have a huge cluster which holds 200TB data.
50 physical servers(150 data nodes). 3 instances running on each node 30gb heap each ( total physical ram 124gb)
3 master nodes ( Total ram 16gb) heap assigned 12gb
3client nodes

Master nodes are continuously doing Garbage Collection. like every 10-20 seconds.
randomly data nodes are unable to get connection to master node and doing failover to another master node and again point back to original master node.

last night all of sudden. all nodes started failing one-by-one.

another thing observed is data nodes are rejecting bulk requests for one index. is that causing all heap memory issues?

Please let me know if you need extra info.

What version? How many indices and shards?

Hi Mark,

elastic version 6.3.1 ( upgraded couple weeks back )
indices 1200
shards 12000
total size 220 TB

Thanks for your response.

Do you have Monitoring enabled?

yes basic monitoring

As Elasticsearch use a lot of off-heap memory, it has long been recommended to assign no more than 50% of available RAM to heap. Given that you have at least 90GB assigned to heap out of 124GB (assuming that master and coordinating only nodes run on other hardware), you have greatly exceeded this, which is clearly not recommended nor optimal.

I have never tested running with this much heap on a node, so am not sure how issues round this would manifest themselves.

Thanks Christian.

Yes, i have doubts on that too. but it is running with that settings for over an year( will plan on making it 2 instances per each physical node )

it was opened to other teams a month back and lot of searches started hitting it recently.

at this point i doubt/fix

  1. heavy searches might be causing the issue. planning on adding slow search logging. i just found that huge aggregation searches can break the cluster. if that is true. is there any preventive measure we can take so that users will not break my cluster? is circuit breaker the config i should check.
  2. fix the message format which are getting rejected by elastic for one index
  3. Also planning on increasing heap for master vm(not just heap will add more ram to the vm and then increase heap). does this help or make it worse as it needs more time to GC?

Please let me know if there is anything else i need to check

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.