Hi,
We have a cluster of 6 nodes on 3 servers, all mdi, using ~32 gb heap memory with G1GC, with version 6.4.2 and the fielddata limits are the same (60%) on all nodes.
We have been consistently and periodically confronting out of memory issues due to increasing heap usage. While addressing the cause of this problem, we realized that the fielddata size in one specific node, which is not the master, is always as 1.5x-2x as that of all other nodes, as below.
curl -XGET 'localhost:9200/_cat/fielddata?v&h=host,node,field,size&fields=text'
host node field size
host1 node1_1 text 3.6gb
host1 node1_2 text 3.6gb
host2 node2_1 text 3.6gb
host2 node2_2 text 5.4gb
host3 node3_1 text 3.6gb
host3 node3_2 text 3.6gb
And, here the limits output are the same in node2_2 and any other node.
Node2_2
curl -XGET 'localhost:9200/_nodes/node2_2/stats?pretty' 2>/dev/null | grep fielddata -A2
"fielddata" : {
"memory_size_in_bytes" : 5899379216,
"evictions" : 0
--
"fielddata" : {
"limit_size_in_bytes" : 20615843020,
"limit_size" : "19.1gb",
Node2_1
curl -XGET 'localhost:9200/_nodes/node2_1/stats?pretty' 2>/dev/null | grep fielddata -A2
"fielddata" : {
"memory_size_in_bytes" : 3953272232,
"evictions" : 0
--
"fielddata" : {
"limit_size_in_bytes" : 20615843020,
"limit_size" : "19.1gb",
Also, there is no G1GC marking cycle logs in any nodes for a long time except Node2_2. There is a new log printed almost per 24 hours.
[2019-06-28T09:02:30,676][INFO ][o.e.m.j.JvmGcMonitorService] [Node2_2] [gc][311694] overhead, spent [328ms] collecting in the last [1s]
[2019-06-28T09:02:41,831][INFO ][o.e.m.j.JvmGcMonitorService] [Node2_2] [gc][old][311696][3] duration [9.9s], collections [1]/[10.1s], total [9.9s]/[30.3s], memory [28.6gb]->[21.9gb]/[32gb], all_pools {[young] [384mb]->[48mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old] [28.2gb]->[21.9gb]/[32gb]}
[2019-06-28T09:02:41,831][WARN ][o.e.m.j.JvmGcMonitorService] [Node2_2] [gc][311696] overhead, spent [10s] collecting in the last [10.1s]
What may possibly cause this difference?
And the second question is that;
To prevent the long and frequent garbage collections, OOMs, or at least decrease the frequency of it, we daily clear all query, request and fielddata caches with the below request.
curl -X POST "localhost:9200/*/_cache/clear"
Is it normal to do so, or is what we do against the natural course of events, logic of caching, and should that totally be avoided?
I appreciate any help.
Thanks.