ES 1.5.2 cluster crashes


(Rural Hunter) #1

Hi,

I have a es 1.5.2 cluster with 6 nodes. They are all configured with 16G heap size. The total shard number is 10*2(10 primary and 10 replica). The total number of doc is about 300 millions and the storage is about 1T.
The cluster runs fine for several weeks and the heap usage of the nodes is all kept at around 50%. Today I experienced 2 times cluster crashes. In my monitor console, I can see the heap usage of all nodes increased rapidly at the same time. Then all the 6 nodes went into full gc loop and the whole cluster hung up. This is the gc log when the problem happens:
[2015-09-10 15:39:15,854][INFO ][monitor.jvm ] [Brute II] [gc][young][21071][49601] duration [764ms], collections [1]/[1.4s], total [764ms]/[25.9m], memory [13.7gb]->[13.9gb]/[15.8gb], all_
pools {[young] [83.1mb]->[2.5mb]/[1.1gb]}{[survivor] [34.1mb]->[34.1mb]/[149.7mb]}{[old] [13.6gb]->[13.8gb]/[14.5gb]}
[2015-09-10 15:39:18,892][INFO ][monitor.jvm ] [Brute II] [gc][young][21073][49603] duration [858ms], collections [1]/[1.6s], total [858ms]/[25.9m], memory [14.1gb]->[14.3gb]/[15.8gb], all_
pools {[young] [5.6mb]->[4.4mb]/[1.1gb]}{[survivor] [34.1mb]->[34.1mb]/[149.7mb]}{[old] [14.1gb]->[14.3gb]/[14.5gb]}
[2015-09-10 15:39:41,173][WARN ][monitor.jvm ] [Brute II] [gc][old][21074][399] duration [21.7s], collections [1]/[22.2s], total [21.7s]/[1.8m], memory [14.3gb]->[12gb]/[15.8gb], all_pools
{[young] [4.4mb]->[251.6mb]/[1.1gb]}{[survivor] [34.1mb]->[0b]/[149.7mb]}{[old] [14.3gb]->[11.8gb]/[14.5gb]}
[2015-09-10 15:39:42,452][INFO ][monitor.jvm ] [Brute II] [gc][young][21075][49605] duration [812ms], collections [1]/[1.2s], total [812ms]/[25.9m], memory [12gb]->[13.4gb]/[15.8gb], all_po
ols {[young] [251.6mb]->[571.4mb]/[1.1gb]}{[survivor] [0b]->[149.7mb]/[149.7mb]}{[old] [11.8gb]->[12.7gb]/[14.5gb]}
[2015-09-10 15:39:44,943][WARN ][monitor.jvm ] [Brute II] [gc][young][21076][49606] duration [1.5s], collections [1]/[2.4s], total [1.5s]/[26m], memory [13.4gb]->[14.1gb]/[15.8gb], all_pool
s {[young] [571.4mb]->[61.3mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old] [12.7gb]->[13.9gb]/[14.5gb]}
[2015-09-10 15:40:24,203][WARN ][monitor.jvm ] [Brute II] [gc][old][21079][400] duration [36.9s], collections [1]/[37.2s], total [36.9s]/[2.4m], memory [15.1gb]->[15.6gb]/[15.8gb], all_pool
s {[young] [1.1gb]->[1.1gb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old] [13.9gb]->[14.5gb]/[14.5gb]}

There was not any gc log before this sudden heap usage peak. What could be the causes for this kind of sudden heap increasing and how can I investigate the root cause?


(Mark Walkom) #2

It could be more queries?

Check hot_threads and slow log. Also what monitoring do you have on the system?


(Rural Hunter) #3

There should not be sudden query peak when the problem happens. I can not get hot threads when the problem happens because the cluster hangs. There are some slow queries but I think they are because of the GC as those queries are quite normal and should be very quick normally.
I have monitor on heap usage, query numbers, doc numbers, merge counts and they are all normal before the problem.


(Rural Hunter) #4

This is the heap usage of those nodes. The final drop is after I restarted all the nodes. You can see the heap usage climbs very quickly before the restart.


(Jorge Gallegos) #5

@Rural_Hunter did you ever figure it out?


(Rural Hunter) #6

We suspect this was caused by some queries with very large 'from' value but can not confirm this. We added some constraints on this value and so far it seems we reduced the occurrence possibility of this problem.


(system) #7