GC not working

esamudio · June 28, 2016, 12:09pm

I've been having a lot of problems dealing with the GC on my cluster. It always reaches a point where the heap is filled up, the GC doesn't evict data and my nodes become unresponsive. I've tried many different things to no avail. I'm currently dealing with 20 shards, 2 replicas, 16GB of ram (8GB for heap), mlockall set to true, ran "sudo swapoff -a" on all the machines. However, I still get unresponsiveness after a while due to GC not doing its job:

[2016-06-28 11:44:45,302][WARN ][monitor.jvm ] [ursa-es-data-node-18] [gc][old][50439][1134] duration [24.7s], collections [1]/[24.8s], total [24.7s]/[5.9h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [399.4mb]->[399.4mb]/[399.4mb]}{[survivor] [42.5mb]->[38.1mb]/[49.8mb]}{[old] [7.5gb]->[7.5gb]/[7.5gb]}
[2016-06-28 11:45:19,942][WARN ][monitor.jvm ] [ursa-es-data-node-18] [gc][old][50440][1135] duration [34.5s], collections [1]/[34.6s], total [34.5s]/[5.9h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [399.4mb]->[399.4mb]/[399.4mb]}{[survivor] [38.1mb]->[37mb]/[49.8mb]}{[old] [7.5gb]->[7.5gb]/[7.5gb]}
[2016-06-28 11:45:44,822][WARN ][monitor.jvm ] [ursa-es-data-node-18] [gc][old][50441][1136] duration [24.7s], collections [1]/[24.8s], total [24.7s]/[5.9h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [399.4mb]->[399.4mb]/[399.4mb]}{[survivor] [37mb]->[36.3mb]/[49.8mb]}{[old] [7.5gb]->[7.5gb]/[7.5gb]}
[2016-06-28 11:46:18,494][WARN ][monitor.jvm ] [ursa-es-data-node-18] [gc][old][50442][1137] duration [33.5s], collections [1]/[33.6s], total [33.5s]/[5.9h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [399.4mb]->[399.4mb]/[399.4mb]}{[survivor] [36.3mb]->[36.4mb]/[49.8mb]}{[old] [7.5gb]->[7.5gb]/[7.5gb]}
[2016-06-28 11:46:43,184][WARN ][monitor.jvm ] [ursa-es-data-node-18] [gc][old][50443][1138] duration [24.5s], collections [1]/[24.6s], total [24.5s]/[5.9h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [399.4mb]->[399.4mb]/[399.4mb]}{[survivor] [36.4mb]->[35.3mb]/[49.8mb]}{[old] [7.5gb]->[7.5gb]/[7.5gb]}
[2016-06-28 11:47:16,802][WARN ][monitor.jvm ] [ursa-es-data-node-18] [gc][old][50444][1139] duration [33.5s], collections [1]/[33.6s], total [33.5s]/[5.9h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [399.4mb]->[399.4mb]/[399.4mb]}{[survivor] [35.3mb]->[34.9mb]/[49.8mb]}{[old] [7.5gb]->[7.5gb]/[7.5gb]}
[2016-06-28 11:47:43,330][WARN ][monitor.jvm ] [ursa-es-data-node-18] [gc][old][50445][1140] duration [26.4s], collections [1]/[26.5s], total [26.4s]/[5.9h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [399.4mb]->[399.4mb]/[399.4mb]}{[survivor] [34.9mb]->[32.3mb]/[49.8mb]}{[old] [7.5gb]->[7.5gb]/[7.5gb]}

Does anyone have an idea of what I can do to try to debug this problem?

ravitandur · June 28, 2016, 2:08pm

Please send me the output of http://localhost:9200/_nodes/stats?pretty&human
If my guess is correct, most of the heap is occupied by the filed data cache.
If that is the case we have alternative to reduce the filed data cache.

jasontedor · June 28, 2016, 2:08pm

You need to dump the node stats API and maybe the heap and find out what's sitting on the heap that can not be freed (field data, suggesters, etc.). Then you'll have to address accordingly.

esamudio · June 28, 2016, 3:39pm

I ran that and it was a really really long output. I made a gist on github for it:
gist

Thanks for helping me look into it! I'm still not proficient enough myself to be able to debug this type of problems.

esamudio · June 28, 2016, 4:06pm

ursa-es-data-node-18 is one of the problematic nodes at the moment.

ravitandur · June 29, 2016, 8:31am

It seems, the output which you send is not during the problematic period.
Based on the out put cluster is started around 50 mins back and there are no frequent GCs happening on any machines.
Can you please send me the output when your cluster is un-responsive state?

Also, some other observations:
22 node cluster
only 1 eligible mater node.
1 client node
20 data nodes.
In my opinion, only master node for this cluster is very bad choice. We need to have at least 3 master nodes to make the cluster highly available.

Topic		Replies	Views
GC Problem Elasticsearch	3	363	July 6, 2017
Elasticsearch - getting GC and the node reconnect from the cluster Elasticsearch	3	225	May 2, 2023
Maximum heap reached and GC not working Elasticsearch	3	626	August 13, 2020
GC Failures Elasticsearch	10	1881	December 20, 2018
Getting gc and node disconnected from the cluster Elasticsearch	5	43	September 4, 2024

GC not working

Related topics