Hi,
I am having a 3 node ES 2.3.2 cluster running on Linux(2.6.32-400.37.1.el6uek.x86_64). For every few hours nodes each nodes goes into Long GC's taking time from 2 minutes to 5 minutes and hence node gets disconnected and cluster gets unstable resulting in indexing activity/search activity to fail.
As i read disabling swapping would help in OLD GC to run quickly in millis than seconds , I have sent the mlockall:true and when i verify it in _nodes API i see it set. Yet I still face the long GC's( both old and Young) . Can you guys provide me some pointers
[2017-08-24 17:10:31,748][WARN ][monitor.jvm ] [node2] [gc][young][96270][321] duration [5.3s], collections [1]/[6.3s], total [5.3s]/[7.3m], memory [15.4gb]->[13.2gb]/[19.5gb], all_pools {[young] [3.3gb]->[68.4mb]/[3.4gb]}{[survivor] [440.9mb]->[440.9mb]/[440.9mb]}{[old] [11.6gb]->[12.7gb]/[15.6gb]}
[2017-08-24 17:10:33,667][INFO ][cluster.service ] [node2] detected_master {node1}{mDHkISCtTNikmybnFKWAYg}{17x.xx.xx.221}{17x.xx.xx.221:9300}, added {{node1}{mDHkISCtTNikmybnFKWAYg}{17x.xx.xx.221}{17x.xx.xx.221:9300},}, reason: zen-disco-receive(from master [{node1}{mDHkISCtTNikmybnFKWAYg}{17x.xx.xx.221}{17x.xx.xx.221:9300}])
[2017-08-24 17:12:50,033][WARN ][monitor.jvm ] [node2] [gc][young][96275][322] duration [1.8m], collections [1]/[2.2m], total [1.8m]/[9.2m], memory [16.3gb]->[5.2gb]/[19.5gb], all_pools {[young] [3.1gb]->[55.1mb]/[3.4gb]}{[survivor] [440.9mb]->[0b]/[440.9mb]}{[old] [12.7gb]->[5.2gb]/[15.6gb]}
[2017-08-24 17:12:50,033][WARN ][monitor.jvm ] [node2] [gc][old][96275][13] duration [22.2s], collections [1]/[2.2m], total [22.2s]/[1.6m], memory [16.3gb]->[5.2gb]/[19.5gb], all_pools {[young] [3.1gb]->[55.1mb]/[3.4gb]}{[survivor] [440.9mb]->[0b]/[440.9mb]}{[old] [12.7gb]->[5.2gb]/[15.6gb]}
[2017-08-24 17:12:50,269][INFO ][discovery.zen ] [node2] master_left [{node1}{mDHkISCtTNikmybnFKWAYg}{17x.xx.xx.221}{17x.xx.xx.221:9300}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2017-08-24 17:12:50,269][WARN ][discovery.zen ] [node2] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{node2}{yetMu9irR4q25WCyI265lw}{17x.xx.xx.222}{node2/17x.xx.xx.222:9300},{node3}{i4Up59UlQzOqY4q-i4-ZAg}{17x.xx.xx.223}{17x.xx.xx.223:9300},}
[2017-08-24 17:12:50,270][INFO ][cluster.service ] [node2] removed {{node1}{mDHkISCtTNikmybnFKWAYg}{17x.xx.xx.221}{17x.xx.xx.221:9300},}, reason: zen-disco-master_failed ({node1}{mDHkISCtTNikmybnFKWAYg}{17x.xx.xx.221}{17x.xx.xx.221:9300})
[2017-08-24 17:12:53,493][INFO ][cluster.service ] [node2] detected_master {node1}{mDHkISCtTNikmybnFKWAYg}{17x.xx.xx.221}{17x.xx.xx.221:9300}, added {{node1}{mDHkISCtTNikmybnFKWAYg}{17x.xx.xx.221}{17x.xx.xx.221:9300},}, reason: zen-disco-receive(from master [{node1}{mDHkISCtTNikmybnFKWAYg}{17x.xx.xx.221}{17x.xx.xx.221:9300}])
[2017-08-24 17:12:55,917][WARN ][monitor.jvm ] [node2] [gc][young][96279][323] duration [2.3s], collections [1]/[2.8s], total [2.3s]/[9.2m], memory [8gb]->[6.6gb]/[19.5gb], all_pools {[young] [2.8gb]->[54.3mb]/[3.4gb]}{[survivor] [0b]->[440.9mb]/[440.9mb]}{[old] [5.2gb]->[6.1gb]/[15.6gb]}
[2017-08-24 17:13:08,406][WARN ][monitor.jvm ] [node2] [gc][young][96284][324] duration [8.2s], collections [1]/[8.4s], total [8.2s]/[9.3m], memory [10gb]->[8.8gb]/[19.5gb], all_pools {[young] [3.4gb]->[401.1mb]/[3.4gb]}{[survivor] [440.9mb]->[440.9mb]/[440.9mb]}{[old] [6.1gb]->[7.9gb]/[15.6gb]}
[2017-08-24 17:13:20,165][WARN ][monitor.jvm ] [node2] [gc][young][96288][325] duration [8s], collections [1]/[8.5s], total [8s]/[9.5m], memory [10.8gb]->[10.8gb]/[19.5gb], all_pools {[young] [2.4gb]->[903.2mb]/[3.4gb]}{[survivor] [440.9mb]->[440.9mb]/[440.9mb]}{[old] [7.9gb]->[9.5gb]/[15.6gb]}
[2017-08-24 17:13:36,853][WARN ][monitor.jvm ] [node2] [gc][young][96292][326] duration [13.6s], collections [1]/[13.6s], total [13.6s]/[9.7m], memory [13.4gb]->[12.8gb]/[19.5gb], all_pools {[young] [3.4gb]->[142.5mb]/[3.4gb]}{[survivor] [440.9mb]->[440.9mb]/[440.9mb]}{[old] [9.5gb]->[12.2gb]/[15.6gb]}
[2017-08-24 17:13:53,700][WARN ][transport ] [node2] Received response for a request that has timed out, sent [57780ms] ago, timed out [16846ms] ago, action [internal:discovery/zen/fd/master_ping], node [{node1}{mDHkISCtTNikmybnFKWAYg}{17x.xx.xx.221}{17x.xx.xx.221:9300}], id [99933]
[2017-08-24 17:17:06,826][WARN ][monitor.jvm ] [node2] [gc][old][96310][14] duration [3.2m], collections [1]/[3.2m], total [3.2m]/[4.9m], memory [16.1gb]->[10.9gb]/[19.5gb], all_pools {[young] [3.4gb]->[86.6mb]/[3.4gb]}{[survivor] [440.9mb]->[0b]/[440.9mb]}{[old] [12.2gb]->[10.8gb]/[15.6gb]}
[2017-08-24 17:17:06,846][INFO ][discovery.zen ] [node2] master_left [{node1}{mDHkISCtTNikmybnFKWAYg}{17x.xx.xx.221}{17x.xx.xx.221:9300}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2017-08-24 17:19:39,078][WARN ][cluster.service ] [node2] cluster state update task [zen-disco-receive(from master [{node1}{mDHkISCtTNikmybnFKWAYg}{17x.xx.xx.221}{17x.xx.xx.221:9300}])] took 6.7m above the warn threshold of 30s
[2017-08-24 17:19:39,080][WARN ][discovery.zen ] [node2] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{node2}{yetMu9irR4q25WCyI265lw}{17x.xx.xx.222}{node2/17x.xx.xx.222:9300},{node3}{i4Up59UlQzOqY4q-i4-ZAg}{17x.xx.xx.223}{17x.xx.xx.223:9300},}
[2017-08-24 17:19:39,080][INFO ][cluster.service ] [node2] removed {{node1}{mDHkISCtTNikmybnFKWAYg}{17x.xx.xx.221}{17x.xx.xx.221:9300},}, reason: zen-disco-master_failed ({node1}{mDHkISCtTNikmybnFKWAYg}{17x.xx.xx.221}{17x.xx.xx.221:9300})
[2017-08-24 17:19:39,086][INFO ][discovery.zen ] [node2] master_left [null], reason [failed to perform initial connect [null]]