OOM | load issues | quick question about mixed patch level in cluster

Hey,

just wanted to ask if having a few nodes with a newer patch level within a 5.2.0 cluster (in our case: three 5.2.2 nodes) is a really, REALLY bad idea?

We are seeing OOM crashes and crazy high loads before the crash ( >50 on a 10 core + 10 HT core machine) on the 5.2.2 nodes, maybe that's related to a mixed patch-level operation?

Seeing the 5.2.2 nodes misbehaving like that raises some red flags about upgrading all nodes...

es logfile | es config

OS:
Linux es-big-19 4.4.0-69-generic #90~14.04.1-Ubuntu SMP Thu Mar 16 19:30:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Having mixed versions across a cluster is indeed a bad idea. For primary shards assigned to the nodes with the higher version it may not be possible to create replica shards on the lower version nodes due to differences in Lucene versions, which can lead to imbalances and potential data loss.

but aside from the replica shard assignment it's "okay"? :slight_smile:

[edit]
Still wondering what exactly happens when the 5.2.2 nodes just randomly spiral into their death :-/

There may be other issues that I am not aware of, so I would recommend upgrading all nodes to the same patch level.

Okay, we moved all nodes to 5.2.2. Still seeing OOM deaths, probably related to higher query cache settings (moved them from 2% to 6% since 5.2.2 fixed some memory leaks).

The query cache is still leaking memory somewhere ... for now we moved the query cache limit back to 2% and are hoping for an uneventful weekend.

[2017-03-25T17:54:17,743][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8626] overhead, spent [7.9s] collecting in the last [8.2s]
[2017-03-25T17:54:26,548][INFO ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][old][8627][755] duration [7.9s], collections [1]/[9.1s], total [7.9s]/[27.9m], memory [20.2gb]->[20.2gb]/[20.3gb], all_po$
[2017-03-25T17:54:26,548][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8627] overhead, spent [7.9s] collecting in the last [9.1s]
[2017-03-25T17:54:35,305][INFO ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][old][8628][756] duration [8.1s], collections [1]/[8.7s], total [8.1s]/[28m], memory [20.2gb]->[20.3gb]/[20.3gb], all_pool$
[2017-03-25T17:54:35,305][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8628] overhead, spent [8.1s] collecting in the last [8.7s]
[2017-03-25T17:54:44,336][INFO ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][old][8629][757] duration [8.4s], collections [1]/[9s], total [8.4s]/[28.2m], memory [20.3gb]->[20.3gb]/[20.3gb], all_pool$
[2017-03-25T17:54:44,357][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8629] overhead, spent [8.4s] collecting in the last [9s]
[2017-03-25T17:54:52,639][INFO ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][old][8630][758] duration [8s], collections [1]/[8.2s], total [8s]/[28.3m], memory [20.3gb]->[20.3gb]/[20.3gb], all_pools $
[2017-03-25T17:54:52,639][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8630] overhead, spent [8s] collecting in the last [8.2s]
[2017-03-25T17:55:01,384][INFO ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][old][8631][759] duration [8.5s], collections [1]/[8.7s], total [8.5s]/[28.4m], memory [20.3gb]->[20.3gb]/[20.3gb], all_po$
[2017-03-25T17:55:01,384][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8631] overhead, spent [8.5s] collecting in the last [8.7s]
[2017-03-25T17:55:09,525][INFO ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][old][8632][760] duration [7.6s], collections [1]/[7.8s], total [7.6s]/[28.6m], memory [20.3gb]->[20.3gb]/[20.3gb], all_po$
[2017-03-25T17:55:09,525][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8632] overhead, spent [7.6s] collecting in the last [7.8s]
[2017-03-25T17:55:18,271][INFO ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][old][8633][761] duration [8.8s], collections [1]/[9s], total [8.8s]/[28.7m], memory [20.3gb]->[20.3gb]/[20.3gb], all_pool$
[2017-03-25T17:55:18,271][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8633] overhead, spent [8.8s] collecting in the last [9s]
[2017-03-25T17:55:26,762][INFO ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][old][8634][762] duration [8.2s], collections [1]/[8.1s], total [8.2s]/[28.8m], memory [20.3gb]->[20.3gb]/[20.3gb], all_po$
[2017-03-25T17:55:26,762][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8634] overhead, spent [8.2s] collecting in the last [8.1s]
[2017-03-25T17:55:35,641][INFO ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][old][8635][763] duration [8.6s], collections [1]/[9.2s], total [8.6s]/[29m], memory [20.3gb]->[20.3gb]/[20.3gb], all_pool$
[2017-03-25T17:55:35,641][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8635] overhead, spent [8.6s] collecting in the last [9.2s]
[2017-03-25T17:55:43,958][INFO ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][old][8636][764] duration [7.7s], collections [1]/[8s], total [7.7s]/[29.1m], memory [20.3gb]->[20.3gb]/[20.3gb], all_pool$
[2017-03-25T17:55:43,958][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8636] overhead, spent [7.7s] collecting in the last [8s]
[2017-03-25T17:55:52,936][INFO ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][old][8637][765] duration [9.1s], collections [1]/[9.2s], total [9.1s]/[29.3m], memory [20.3gb]->[20.3gb]/[20.3gb], all_po$
[2017-03-25T17:55:52,936][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8637] overhead, spent [9.1s] collecting in the last [9.2s]
[2017-03-25T17:56:01,058][INFO ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][old][8638][766] duration [8s], collections [1]/[8.1s], total [8s]/[29.4m], memory [20.3gb]->[20.3gb]/[20.3gb], all_pools $
[2017-03-25T17:56:01,058][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8638] overhead, spent [8s] collecting in the last [8.1s]
[2017-03-25T17:56:18,569][INFO ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][old][8639][767] duration [9.2s], collections [1]/[9.2s], total [9.2s]/[29.6m], memory [20.3gb]->[20.3gb]/[20.3gb], all_po$
[2017-03-25T17:56:27,408][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8639] overhead, spent [9.2s] collecting in the last [9.2s]
[2017-03-25T17:58:17,349][INFO ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][old][8640][777] duration [1.4m], collections [10]/[1.4m], total [1.4m]/[31m], memory [20.3gb]->[20.3gb]/[20.3gb], all_poo$
[2017-03-25T17:59:31,930][WARN ][o.e.m.j.JvmGcMonitorService] [es-big-14] [gc][8640] overhead, spent [1.4m] collecting in the last [1.4m]
[2017-03-25T18:06:58,819][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [es-big-14] fatal error in thread [elasticsearch[es-big-14][warmer][T#5]], exiting
java.lang.OutOfMemoryError: Java heap space

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.