Hi,
We are experiencing long GC pauses:
[2015-10-30 11:07:06,647][INFO ][monitor.jvm ] [SERVER1] [gc][young][107581][13144] duration [726ms], collections [1]/[1s], total [726ms]/[12.4m], memory [15.6gb]->[15.4gb]/[29.8gb], all_pools {[young] [1.3gb]->[687.9mb]/[1.4gb]}{[survivor] [117.1mb]->[191.3mb]/[191.3mb]}{[old] [14.1gb]->[14.5gb]/[28.1gb]}
[2015-10-30 11:07:07,699][INFO ][monitor.jvm ] [SERVER1] [gc][young][107582][13145] duration [740ms], collections [1]/[1s], total [740ms]/[12.4m], memory [15.4gb]->[15.6gb]/[29.8gb], all_pools {[young] [687.9mb]->[15.7mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [14.5gb]->[15.4gb]/[28.1gb]}
[2015-10-30 11:09:07,207][INFO ][monitor.jvm ] [SERVER1] [gc][young][107697][13167] duration [739ms], collections [1]/[1.3s], total [739ms]/[12.6m], memory [24.3gb]->[25.2gb]/[29.8gb], all_pools {[young] [4.3mb]->[17.1mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [24.1gb]->[25gb]/[28.1gb]}
[2015-10-30 11:09:08,660][INFO ][monitor.jvm ] [SERVER1] [gc][young][107698][13168] duration [744ms], collections [1]/[1.4s], total [744ms]/[12.6m], memory [25.2gb]->[26.1gb]/[29.8gb], all_pools {[young] [17.1mb]->[5.5mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [25gb]->[25.9gb]/[28.1gb]}
[2015-10-30 11:09:10,114][INFO ][monitor.jvm ] [SERVER1] [gc][young][107699][13169] duration [754ms], collections [1]/[1.4s], total [754ms]/[12.6m], memory [26.1gb]->[26.9gb]/[29.8gb], all_pools {[young] [5.5mb]->[7.1mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [25.9gb]->[26.7gb]/[28.1gb]}
[2015-10-30 11:12:46,660][WARN ][monitor.jvm ] [SERVER1] [gc][old][107714][275] duration [50.7s], collections [1]/[51.5s], total [50.7s]/[4.2m], memory [29.7gb]->[29.4gb]/[29.8gb], all_pools {[young] [1.4gb]->[1.3gb]/[1.4gb]}{[survivor] [97mb]->[0b]/[191.3mb]}{[old] [28.1gb]->[28.1gb]/[28.1gb]}
[2015-10-30 11:13:45,613][WARN ][monitor.jvm ] [SERVER1] [gc][old][107716][276] duration [57.4s], collections [1]/[57.9s], total [57.4s]/[5.2m], memory [29.7gb]->[29.5gb]/[29.8gb], all_pools {[young] [1.4gb]->[1.3gb]/[1.4gb]}{[survivor] [116.3mb]->[0b]/[191.3mb]}{[old] [28.1gb]->[28.1gb]/[28.1gb]}
[2015-10-30 11:14:38,895][WARN ][monitor.jvm ] [SERVER1] [gc][old][107719][277] duration [49.8s], collections [1]/[50.5s], total [49.8s]/[6m], memory [29.7gb]->[29.6gb]/[29.8gb], all_pools {[young] [1.4gb]->[1.4gb]/[1.4gb]}{[survivor] [130mb]->[1.1mb]/[191.3mb]}{[old] [28.1gb]->[28.1gb]/[28.1gb]}
[2015-10-30 11:15:32,315][WARN ][monitor.jvm ] [SERVER1] [gc][old][107721][278] duration [50.5s], collections [1]/[51s], total [50.5s]/[6.9m], memory [29.8gb]->[29.6gb]/[29.8gb], all_pools {[young] [1.4gb]->[1.4gb]/[1.4gb]}{[survivor] [183.6mb]->[46.7mb]/[191.3mb]}{[old] [28.1gb]->[28.1gb]/[28.1gb]}
[2015-10-30 11:16:34,482][WARN ][monitor.jvm ] [SERVER1] [gc][old][107723][279] duration [59.8s], collections [1]/[59.9s], total [59.8s]/[7.9m], memory [29.8gb]->[29.5gb]/[29.8gb], all_pools {[young] [1.4gb]->[1.4gb]/[1.4gb]}{[survivor] [189.2mb]->[0b]/[191.3mb]}{[old] [28.1gb]->[28.1gb]/[28.1gb]}
[2015-10-30 11:17:30,741][WARN ][monitor.jvm ] [SERVER1] [gc][old][107725][280] duration [55s], collections [1]/[53.8s], total [55s]/[8.8m], memory [29.7gb]->[29.6gb]/[29.8gb], all_pools {[young] [1.4gb]->[1.4gb]/[1.4gb]}{[survivor] [167.2mb]->[4.4mb]/[191.3mb]}{[old] [28.1gb]->[28.1gb]/[28.1gb]}
[2015-10-30 11:18:31,689][WARN ][monitor.jvm ] [SERVER1] [gc][old][107728][281] duration [58.1s], collections [1]/[58.9s], total [58.1s]/[9.8m], memory [29.7gb]->[29.6gb]/[29.8gb], all_pools {[young] [1.4gb]->[1.4gb]/[1.4gb]}{[survivor] [191.3mb]->[13.7mb]/[191.3mb]}{[old] [28.1gb]->[28.1gb]/[28.1gb]}
[2015-10-30 11:19:32,905][WARN ][monitor.jvm ] [SERVER1] [gc][old][107731][282] duration [57.2s], collections [1]/[57.9s], total [57.2s]/[10.8m], memory [29.7gb]->[29.6gb]/[29.8gb], all_pools {[young] [1.4gb]->[1.4gb]/[1.4gb]}{[survivor] [191.3mb]->[28.4mb]/[191.3mb]}{[old] [28.1gb]->[28.1gb]/[28.1gb]}
.....
Usage / Configuration / Versions:
1.25B docs
60 total shards (30 primary + 30 replicas)
15 data nodes (Each 60GB RAM, 32 processors, of which 30GB is heap, like recommended)
Storage size in each node is around 150GB, shard size is around 33GB (I know it's a little above recommended, don't know if 10% might be causing this..)
We have ~150-200 searches per second
We’re using Java 7u67 (Oracle), with recommended GC configuration.
Our cluster is fine most of the time, but once every couple of days it goes nuts with the error log stated on top.
Any help would be much appreciated, as I'm not really sure what debug options are there to understand in ES what's causing this.
Thanks.