Elasticsearch filling up all it's memory and crushes - How to debug

We used to run Solr search on our project. I am part of the operations team. We noticed Solr crushing all the time with OOM-Exceptions.
Since i have been using elasticsearch for centralised logging, monitoring, some custom services etc i recommended switching to elastic. The performance gain is incredible in the teams project but we have again faced a memory problem.

[2019-08-28T12:39:03,615][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383352]    [13845] duration [1s], collections [1]/[1.6s], total [1s]/[11.2m], memory [18.9gb]->[19gb]/[63.8gb], all_pools {[young] [579.1mb]->[7mb]/[1.3gb]}{[survivor] [27.1mb]->[166.3mb]/[166.3mb]}{[old] [18.3gb]->[18.9gb]/[62.3gb]}
[2019-08-28T12:39:03,615][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383352] overhead, spent [1s] collecting in the last [1.6s]
[2019-08-28T12:39:05,390][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383353] overhead, spent [1.3s] collecting in the last [1.7s]
[2019-08-28T12:39:07,193][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383354][13847] duration [1.3s], collections [1]/[1.8s], total [1.3s]/[11.2m], memory [20.4gb]->[21.7gb]/[63.8gb], all_pools {[young] [14.2mb]->[18.3mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [20.2gb]->[21.5gb]/[62.3gb]}
[2019-08-28T12:39:07,193][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383354] overhead, spent [1.3s] collecting in the last [1.8s]
[2019-08-28T12:39:08,915][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383355] overhead, spent [1.3s] collecting in the last [1.7s]
[2019-08-28T12:39:10,677][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383356][13849] duration [1.3s], collections [1]/[1.7s], total [1.3s]/[11.3m], memory [22.9gb]->[24.2gb]/[63.8gb], all_pools {[young] [21mb]->[22.5mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [22.8gb]->[24gb]/[62.3gb]}
[2019-08-28T12:39:10,677][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383356] overhead, spent [1.3s] collecting in the last [1.7s]
[2019-08-28T12:39:12,360][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383357][13850] duration [1.2s], collections [1]/[1.6s], total [1.2s]/[11.3m], memory [24.2gb]->[25.5gb]/[63.8gb], all_pools {[young] [22.5mb]->[24.1mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [24gb]->[25.3gb]/[62.3gb]}
[2019-08-28T12:39:12,360][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383357] overhead, spent [1.2s] collecting in the last [1.6s]
.
.
.
[2019-08-28T12:41:39,190][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383390] overhead, spent [877ms] collecting in the last [1.2s]
[2019-08-28T12:41:41,010][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383391]
[2019-08-28T12:41:44,649][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383393][13886] duration [1.4s], collections [1]/[1.8s], total [1.4s]/[12.3m], memory [54.6gb]->[55.9gb]/[63.8gb], all_pools {[young] [26.3mb]->[26.4mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [54.4gb]->[55.7gb]/[62.3gb]}
[2019-08-28T12:41:44,649][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383393] overhead, spent [1.4s] collecting in the last [1.8s]
[2019-08-28T12:41:46,501][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383394][13887] duration [1.4s], collections [1]/[1.8s], total [1.4s]/[12.3m], memory [55.9gb]->[57.2gb]/[63.8gb], all_pools {[young] [26.4mb]->[26.4mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [55.7gb]->[57gb]/[62.3gb]}
[2019-08-28T12:41:46,501][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383394] overhead, spent [1.4s] collecting in the last [1.8s]
[2019-08-28T12:41:48,277][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383395][13888] duration [1.3s], collections [1]/[1.7s], total [1.3s]/[12.3m], memory [57.2gb]->[58.5gb]/[63.8gb], all_pools {[young] [26.4mb]->[26.3mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [57gb]->[58.3gb]/[62.3gb]}
[2019-08-28T12:41:48,277][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383395] overhead, spent [1.3s] collecting in the last [1.7s]
[2019-08-28T12:41:50,051][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383396][13889] duration [1.3s], collections [1]/[1.7s], total [1.3s]/[12.3m], memory [58.5gb]->[59.8gb]/[63.8gb], all_pools {[young] [26.3mb]->[26.4mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [58.3gb]->[59.6gb]/[62.3gb]}
[2019-08-28T12:41:50,052][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383396] overhead, spent [1.3s] collecting in the last [1.7s]
[2019-08-28T12:41:51,890][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][young][383397][13890] duration [1.4s], collections [1]/[1.8s], total [1.4s]/[12.4m], memory [59.8gb]->[61.1gb]/[63.8gb], all_pools {[young] [26.4mb]->[26.4mb]/[1.3gb]}{[survivor] [166.3mb]->[166.3mb]/[166.3mb]}{[old] [59.6gb]->[60.9gb]/[62.3gb]}
[2019-08-28T12:45:24,631][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][old][383401][17] duration [1.6m], collections [1]/[1.6m], total [1.6m]/[4.8m], memory [63.7gb]->[63.4gb]/[63.8gb], all_pools {[young] [1.3gb]->[1gb]/[1.3gb]}{[survivor] [92.6mb]->[0b]/[166.3mb]}{[old] [62.3gb]->[62.3gb]/[62.3gb]}
[2019-08-28T12:45:24,631][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][383401] overhead, spent [1.6m] collecting in the last [1.6m]
[2019-08-28T12:47:15,870][WARN ][o.e.m.j.JvmGcMonitorService] [Cart-1] [gc][old][383403][18] duration [1.8m], collections [1]/[1.8m], total [1.8m]/[6.6m], memory [63.7gb]->[63.5gb]/[63.8gb], all_pools {[young] [1.3gb]->[1.2gb]/[1.3gb]}{[survivor] [119.1mb]->[0b]/[166.3mb]}{[old] [62.3gb]->[62.3gb]/[62.3gb]}

As you can see in like 20mins the memory got filled up(if i interpret the logs correct), GC kicked in young and then old.
We restarted the Cluster before it got to crush but for 10mins it remained completely unresponsive so we had to do something fast due to calls not stopping.

So my question is. How can i debug this without access to application? Where should i be looking at? My personal elastic clusters are running with way less memory indexing way more data and handling terrabytes of data. This cluster simply hold 10 indices of 150mb so i can't really imagine what is going downhill other than some freaky app code.

it seems you have configured more than 30gb of heap to your process. This is not recommended. Is there a reason you picked that high amount of heap? Have you used the nodes stats/info APIs to check how much memory is needed when running?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.