Hi,
I have a es 1.5.2 cluster with 6 nodes. They are all configured with 16G heap size. The total shard number is 10*2(10 primary and 10 replica). The total number of doc is about 300 millions and the storage is about 1T.
The cluster runs fine for several weeks and the heap usage of the nodes is all kept at around 50%. Today I experienced 2 times cluster crashes. In my monitor console, I can see the heap usage of all nodes increased rapidly at the same time. Then all the 6 nodes went into full gc loop and the whole cluster hung up. This is the gc log when the problem happens:
[2015-09-10 15:39:15,854][INFO ][monitor.jvm ] [Brute II] [gc][young][21071][49601] duration [764ms], collections [1]/[1.4s], total [764ms]/[25.9m], memory [13.7gb]->[13.9gb]/[15.8gb], all_
pools {[young] [83.1mb]->[2.5mb]/[1.1gb]}{[survivor] [34.1mb]->[34.1mb]/[149.7mb]}{[old] [13.6gb]->[13.8gb]/[14.5gb]}
[2015-09-10 15:39:18,892][INFO ][monitor.jvm ] [Brute II] [gc][young][21073][49603] duration [858ms], collections [1]/[1.6s], total [858ms]/[25.9m], memory [14.1gb]->[14.3gb]/[15.8gb], all_
pools {[young] [5.6mb]->[4.4mb]/[1.1gb]}{[survivor] [34.1mb]->[34.1mb]/[149.7mb]}{[old] [14.1gb]->[14.3gb]/[14.5gb]}
[2015-09-10 15:39:41,173][WARN ][monitor.jvm ] [Brute II] [gc][old][21074][399] duration [21.7s], collections [1]/[22.2s], total [21.7s]/[1.8m], memory [14.3gb]->[12gb]/[15.8gb], all_pools
{[young] [4.4mb]->[251.6mb]/[1.1gb]}{[survivor] [34.1mb]->[0b]/[149.7mb]}{[old] [14.3gb]->[11.8gb]/[14.5gb]}
[2015-09-10 15:39:42,452][INFO ][monitor.jvm ] [Brute II] [gc][young][21075][49605] duration [812ms], collections [1]/[1.2s], total [812ms]/[25.9m], memory [12gb]->[13.4gb]/[15.8gb], all_po
ols {[young] [251.6mb]->[571.4mb]/[1.1gb]}{[survivor] [0b]->[149.7mb]/[149.7mb]}{[old] [11.8gb]->[12.7gb]/[14.5gb]}
[2015-09-10 15:39:44,943][WARN ][monitor.jvm ] [Brute II] [gc][young][21076][49606] duration [1.5s], collections [1]/[2.4s], total [1.5s]/[26m], memory [13.4gb]->[14.1gb]/[15.8gb], all_pool
s {[young] [571.4mb]->[61.3mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old] [12.7gb]->[13.9gb]/[14.5gb]}
[2015-09-10 15:40:24,203][WARN ][monitor.jvm ] [Brute II] [gc][old][21079][400] duration [36.9s], collections [1]/[37.2s], total [36.9s]/[2.4m], memory [15.1gb]->[15.6gb]/[15.8gb], all_pool
s {[young] [1.1gb]->[1.1gb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old] [13.9gb]->[14.5gb]/[14.5gb]}
There was not any gc log before this sudden heap usage peak. What could be the causes for this kind of sudden heap increasing and how can I investigate the root cause?