Elasticsearch filling up all it's memory and crushes - How to debug

sancroth · August 28, 2019, 12:06pm

We used to run Solr search on our project. I am part of the operations team. We noticed Solr crushing all the time with OOM-Exceptions.
Since i have been using elasticsearch for centralised logging, monitoring, some custom services etc i recommended switching to elastic. The performance gain is incredible in the teams project but we have again faced a memory problem.

[2019-08-28T12:39:03,615][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383352]    [13845] duration [1s], collections [1]/[1.6s], total [1s]/[11.2m], memory [18.9gb]->[19gb]/[63.8gb], all_pools {[young] [579.1mb]->[7mb]/[1.3gb]}{[survivor] [27.1mb]->[166.3mb]/[166.3mb]}{[old] [18.3gb]->[18.9gb]/[62.3gb]}
[2019-08-28T12:39:03,615][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383352] overhead, spent [1s] collecting in the last [1.6s]
[2019-08-28T12:39:05,390][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383353] overhead, spent [1.3s] collecting in the last [1.7s]
[2019-08-28T12:39:07,193][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383354][13847] duration [1.3s], collections [1]/[1.8s], total [1.3s]/[11.2m], memory [20.4gb]->[21.7gb]/[63.8gb], all_pools {[young] [14.2mb]->[18.3mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [20.2gb]->[21.5gb]/[62.3gb]}
[2019-08-28T12:39:07,193][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383354] overhead, spent [1.3s] collecting in the last [1.8s]
[2019-08-28T12:39:08,915][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383355] overhead, spent [1.3s] collecting in the last [1.7s]
[2019-08-28T12:39:10,677][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383356][13849] duration [1.3s], collections [1]/[1.7s], total [1.3s]/[11.3m], memory [22.9gb]->[24.2gb]/[63.8gb], all_pools {[young] [21mb]->[22.5mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [22.8gb]->[24gb]/[62.3gb]}
[2019-08-28T12:39:10,677][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383356] overhead, spent [1.3s] collecting in the last [1.7s]
[2019-08-28T12:39:12,360][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383357][13850] duration [1.2s], collections [1]/[1.6s], total [1.2s]/[11.3m], memory [24.2gb]->[25.5gb]/[63.8gb], all_pools {[young] [22.5mb]->[24.1mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [24gb]->[25.3gb]/[62.3gb]}
[2019-08-28T12:39:12,360][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383357] overhead, spent [1.2s] collecting in the last [1.6s]
.
.
.
[2019-08-28T12:41:39,190][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383390] overhead, spent [877ms] collecting in the last [1.2s]
[2019-08-28T12:41:41,010][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383391]
[2019-08-28T12:41:44,649][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383393][13886] duration [1.4s], collections [1]/[1.8s], total [1.4s]/[12.3m], memory [54.6gb]->[55.9gb]/[63.8gb], all_pools {[young] [26.3mb]->[26.4mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [54.4gb]->[55.7gb]/[62.3gb]}
[2019-08-28T12:41:44,649][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383393] overhead, spent [1.4s] collecting in the last [1.8s]
[2019-08-28T12:41:46,501][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383394][13887] duration [1.4s], collections [1]/[1.8s], total [1.4s]/[12.3m], memory [55.9gb]->[57.2gb]/[63.8gb], all_pools {[young] [26.4mb]->[26.4mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [55.7gb]->[57gb]/[62.3gb]}
[2019-08-28T12:41:46,501][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383394] overhead, spent [1.4s] collecting in the last [1.8s]
[2019-08-28T12:41:48,277][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383395][13888] duration [1.3s], collections [1]/[1.7s], total [1.3s]/[12.3m], memory [57.2gb]->[58.5gb]/[63.8gb], all_pools {[young] [26.4mb]->[26.3mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [57gb]->[58.3gb]/[62.3gb]}
[2019-08-28T12:41:48,277][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383395] overhead, spent [1.3s] collecting in the last [1.7s]
[2019-08-28T12:41:50,051][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383396][13889] duration [1.3s], collections [1]/[1.7s], total [1.3s]/[12.3m], memory [58.5gb]->[59.8gb]/[63.8gb], all_pools {[young] [26.3mb]->[26.4mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [58.3gb]->[59.6gb]/[62.3gb]}
[2019-08-28T12:41:50,052][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383396] overhead, spent [1.3s] collecting in the last [1.7s]
[2019-08-28T12:41:51,890][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383397][13890] duration [1.4s], collections [1]/[1.8s], total [1.4s]/[12.4m], memory [59.8gb]->[61.1gb]/[63.8gb], all_pools {[young] [26.4mb]->[26.4mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [59.6gb]->[60.9gb]/[62.3gb]}
[2019-08-28T12:45:24,631][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][old][383401][17] duration [1.6m], collections [1]/[1.6m], total [1.6m]/[4.8m], memory [63.7gb]->[63.4gb]/[63.8gb], all_pools {[young] [1.3gb]->[1gb]/[1.3gb]}{[survivor] [92.6mb]->[0b]/[166.3mb]}{[old] [62.3gb]->[62.3gb]/[62.3gb]}
[2019-08-28T12:45:24,631][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383401] overhead, spent [1.6m] collecting in the last [1.6m]
[2019-08-28T12:47:15,870][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][old][383403][18] duration [1.8m], collections [1]/[1.8m], total [1.8m]/[6.6m], memory [63.7gb]->[63.5gb]/[63.8gb], all_pools {[young] [1.3gb]->[1.2gb]/[1.3gb]}{[survivor] [119.1mb]->[0b]/[166.3mb]}{[old] [62.3gb]->[62.3gb]/[62.3gb]}

As you can see in like 20mins the memory got filled up(if i interpret the logs correct), GC kicked in young and then old.
We restarted the Cluster before it got to crush but for 10mins it remained completely unresponsive so we had to do something fast due to calls not stopping.

So my question is. How can i debug this without access to application? Where should i be looking at? My personal elastic clusters are running with way less memory indexing way more data and handling terrabytes of data. This cluster simply hold 10 indices of 150mb so i can't really imagine what is going downhill other than some freaky app code.

spinscale · August 29, 2019, 7:59am

it seems you have configured more than 30gb of heap to your process. This is not recommended. Is there a reason you picked that high amount of heap? Have you used the nodes stats/info APIs to check how much memory is needed when running?

system · September 26, 2019, 7:59am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Garbage collection log messages, [monitor.jvm ... duration [2.2m] Elasticsearch	8	11045	July 6, 2017
Lack of memory? Elasticsearch	11	800	July 6, 2017
Memory problems Elasticsearch	27	1331	July 6, 2017
Regarding memory consumption in elastic search Elasticsearch	7	1467	July 6, 2017
Frequent GC in elasticsearch Elasticsearch	9	7272	July 5, 2017

Elasticsearch filling up all it's memory and crushes - How to debug

Related topics