CPU 100% after upgrading from 5.6.5 to 6.4.0

Hi there!
After upgrading from 5.6.5 to 6.4.0 we are facing random CPU usage to 100% on elasticsearch java process. It happens when executing some queries from Kibana. But it's difficult to understand it from elasticsearch log. It always happens when we load a table visualization with a year timerange that has a lot of filters (an I suppose aggregations).
Elasticsearch stop responding, elasticsearch service stop to work too, and I have to manually kill the process and restart it with service elasticsearch start.

This is the output of the top command on a normal usage.

PID  USER      PR   NI   VIRT RES    SHR    S   %CPU %MEM   TIME+   COMMAND
8029 elastic+  20   0 7726444 4.538g 157816 S   4.0  59.4   3:45.94 java
4142 kibana    20   0 2411680 1.233g  13652 S   0.7  16.1   7:44.75 node

When it goes up to 100%, in the cluster log file, I see a lot of these rows:

[2018-09-13T16:07:40,206][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][old][1859][6] duration [13s], collections [1]/[13.3s], total [13s]/[41.7s], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [74.6mb]->[127.4mb]/[133.1mb]}{[survivor] [0b]->[0b]/[16.6mb]}{[old] [3.8gb]->[3.8gb]/[3.8gb]}
[2018-09-13T16:07:40,206][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][1859] overhead, spent [13s] collecting in the last [13.3s]

and the top command shows:

PID  USER      PR   NI   VIRT RES    SHR    S   %CPU %MEM   TIME+   COMMAND
8029 elastic+  20   0 7984924 4.562g 162520 S 100.0 59.7   8:13.52 java

The host acts as master and single data node.
It's a CentOS 7 with 8GB RAM, dual core CPU running jdk 1.8.0_144.
Everything is installed via rpm packages.

In the elasticsearch JVM options file I set:

-Xms4g
-Xmx4g

Can you help me investigate further?

Thank you

Hi @steno,

This is obviously a garbage collection problem:

  • Make sure that swapping is disabled in your machine
  • Turn on Elasticsearch monitoring and use the Kibana monitoring tools to investigate the issue further

Thank you @fkelbert. Disabling swap modifying /etc/fstab improved time spent doing gc but did not solve completely. Now elasticsearch stays up and running but Kibana looks like freezed and shows updated data after more than 40 seconds (and sometimes it goes on request Timeout 30seconds).
From the cluster logs I can see that it is spending less time in gc tasks:

[2018-09-25T13:58:34,940][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][476] overhead, spent [500ms] collecting in the last [1s]
[2018-09-25T13:58:36,058][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][477] overhead, spent [660ms] collecting in the last [1.1s]
[2018-09-25T13:58:37,109][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][478] overhead, spent [614ms] collecting in the last [1s]
[2018-09-25T13:58:38,111][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][479] overhead, spent [551ms] collecting in the last [1s]
[2018-09-25T13:58:39,113][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][480] overhead, spent [582ms] collecting in the last [1s]
[2018-09-25T13:58:40,113][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][481] overhead, spent [586ms] collecting in the last [1s]
[2018-09-25T13:58:41,113][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][482] overhead, spent [577ms] collecting in the last [1s]
[2018-09-25T13:58:42,114][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][483] overhead, spent [581ms] collecting in the last [1s]
[2018-09-25T13:58:43,119][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][484] overhead, spent [594ms] collecting in the last [1s]
[2018-09-25T13:58:44,121][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][485] overhead, spent [609ms] collecting in the last [1s]
[2018-09-25T13:58:45,178][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][486] overhead, spent [641ms] collecting in the last [1s]
[2018-09-25T13:58:46,236][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][487] overhead, spent [634ms] collecting in the last [1s]
[2018-09-25T13:58:47,321][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][488] overhead, spent [658ms] collecting in the last [1s]
[2018-09-25T13:58:48,390][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][489] overhead, spent [653ms] collecting in the last [1s]
[2018-09-25T13:58:49,476][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][490] overhead, spent [657ms] collecting in the last [1s]
[2018-09-25T13:59:02,537][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][old][491][4] duration [12.5s], collections [1]/[13s], total [12.5s]/[12.8s], memory [5.7gb]->[3.9gb]/[5.9gb], all_pools {[young] [13.3mb]->[12.2mb]/[133.1mb]}{[survivor] [16.6mb]->[0b]/[16.6mb]}{[old] [5.7gb]->[3.9gb]/[5.8gb]}
[2018-09-25T13:59:02,537][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][491] overhead, spent [12.7s] collecting in the last [13s]
[2018-09-25T13:59:03,538][INFO ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][492] overhead, spent [491ms] collecting in the last [1s]
[2018-09-25T13:59:04,564][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][493] overhead, spent [716ms] collecting in the last [1s]
[2018-09-25T13:59:05,619][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][494] overhead, spent [633ms] collecting in the last [1s]
[2018-09-25T13:59:06,687][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][495] overhead, spent [622ms] collecting in the last [1s]
[2018-09-25T13:59:07,748][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][496] overhead, spent [640ms] collecting in the last [1s]
[2018-09-25T13:59:08,826][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][497] overhead, spent [635ms] collecting in the last [1s]
[2018-09-25T13:59:09,836][INFO ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][498] overhead, spent [353ms] collecting in the last [1s]

but it is not enough. I don't know why this happens on 6.x and never happened on 5.6.
Is there something else I can do? Maybe on kibana side?

Thank you!

Yes, these are serious GC times. You may be generally running low in memory. Depending on your data size and operations thereon, 8GB of memory might be too small.

If you don't have any bigger or additional machines available, what you can try is start a free 14-day Elastic Cloud trial, reindex your data into the new cluster, and observe the performance.

It's disappointing that downgrading back to 5.6 solved. Newer versions should be less memory hungry. Looks like it is not, unfortunately. Even increasing memory to 12GB with heap size of 6GB for elasticsearch.
I'll try to investigate further with a no-prod machine.
Thank you for your support @fkelbert, anyway.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.