ES 1.5.2 cluster crashes

Rural_Hunter · September 10, 2015, 9:37am

Hi,

I have a es 1.5.2 cluster with 6 nodes. They are all configured with 16G heap size. The total shard number is 10*2(10 primary and 10 replica). The total number of doc is about 300 millions and the storage is about 1T.
The cluster runs fine for several weeks and the heap usage of the nodes is all kept at around 50%. Today I experienced 2 times cluster crashes. In my monitor console, I can see the heap usage of all nodes increased rapidly at the same time. Then all the 6 nodes went into full gc loop and the whole cluster hung up. This is the gc log when the problem happens:
[2015-09-10 15:39:15,854][INFO ][monitor.jvm ] [Brute II] [gc][young][21071][49601] duration [764ms], collections [1]/[1.4s], total [764ms]/[25.9m], memory [13.7gb]->[13.9gb]/[15.8gb], all_
pools {[young] [83.1mb]->[2.5mb]/[1.1gb]}{[survivor] [34.1mb]->[34.1mb]/[149.7mb]}{[old] [13.6gb]->[13.8gb]/[14.5gb]}
[2015-09-10 15:39:18,892][INFO ][monitor.jvm ] [Brute II] [gc][young][21073][49603] duration [858ms], collections [1]/[1.6s], total [858ms]/[25.9m], memory [14.1gb]->[14.3gb]/[15.8gb], all_
pools {[young] [5.6mb]->[4.4mb]/[1.1gb]}{[survivor] [34.1mb]->[34.1mb]/[149.7mb]}{[old] [14.1gb]->[14.3gb]/[14.5gb]}
[2015-09-10 15:39:41,173][WARN ][monitor.jvm ] [Brute II] [gc][old][21074][399] duration [21.7s], collections [1]/[22.2s], total [21.7s]/[1.8m], memory [14.3gb]->[12gb]/[15.8gb], all_pools
{[young] [4.4mb]->[251.6mb]/[1.1gb]}{[survivor] [34.1mb]->[0b]/[149.7mb]}{[old] [14.3gb]->[11.8gb]/[14.5gb]}
[2015-09-10 15:39:42,452][INFO ][monitor.jvm ] [Brute II] [gc][young][21075][49605] duration [812ms], collections [1]/[1.2s], total [812ms]/[25.9m], memory [12gb]->[13.4gb]/[15.8gb], all_po
ols {[young] [251.6mb]->[571.4mb]/[1.1gb]}{[survivor] [0b]->[149.7mb]/[149.7mb]}{[old] [11.8gb]->[12.7gb]/[14.5gb]}
[2015-09-10 15:39:44,943][WARN ][monitor.jvm ] [Brute II] [gc][young][21076][49606] duration [1.5s], collections [1]/[2.4s], total [1.5s]/[26m], memory [13.4gb]->[14.1gb]/[15.8gb], all_pool
s {[young] [571.4mb]->[61.3mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old] [12.7gb]->[13.9gb]/[14.5gb]}
[2015-09-10 15:40:24,203][WARN ][monitor.jvm ] [Brute II] [gc][old][21079][400] duration [36.9s], collections [1]/[37.2s], total [36.9s]/[2.4m], memory [15.1gb]->[15.6gb]/[15.8gb], all_pool
s {[young] [1.1gb]->[1.1gb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old] [13.9gb]->[14.5gb]/[14.5gb]}

There was not any gc log before this sudden heap usage peak. What could be the causes for this kind of sudden heap increasing and how can I investigate the root cause?

warkolm · September 10, 2015, 9:49am

It could be more queries?

Check hot_threads and slow log. Also what monitoring do you have on the system?

Rural_Hunter · September 10, 2015, 10:02am

There should not be sudden query peak when the problem happens. I can not get hot threads when the problem happens because the cluster hangs. There are some slow queries but I think they are because of the GC as those queries are quite normal and should be very quick normally.
I have monitor on heap usage, query numbers, doc numbers, merge counts and they are all normal before the problem.

Rural_Hunter · September 10, 2015, 10:08am

This is the heap usage of those nodes. The final drop is after I restarted all the nodes. You can see the heap usage climbs very quickly before the restart.

thekad · February 26, 2016, 5:13pm

@Rural_Hunter did you ever figure it out?

Rural_Hunter · February 29, 2016, 7:40am

We suspect this was caused by some queries with very large 'from' value but can not confirm this. We added some constraints on this value and so far it seems we reduced the occurrence possibility of this problem.

Topic		Replies	Views
ES 5.2.2: Sudden heap spikes followed by cluster crash Elasticsearch	15	5220	June 8, 2017
Upgrading ES 2.3.3 to 5.2 Causing Cluster Crash! Elasticsearch	3	765	April 29, 2017
Some nodes are crash looping in ES cluster Elasticsearch	1	492	May 23, 2017
Frequent GC in elasticsearch Elasticsearch	9	7231	July 5, 2017
Please help - ES 2.1.1 cluster randomly crashing Elasticsearch	18	2544	July 5, 2017

ES 1.5.2 cluster crashes

Related topics