CPU 100% after upgrading from 5.6.5 to 6.4.0

steno · September 13, 2018, 1:52pm

Hi there!
After upgrading from 5.6.5 to 6.4.0 we are facing random CPU usage to 100% on elasticsearch java process. It happens when executing some queries from Kibana. But it's difficult to understand it from elasticsearch log. It always happens when we load a table visualization with a year timerange that has a lot of filters (an I suppose aggregations).
Elasticsearch stop responding, elasticsearch service stop to work too, and I have to manually kill the process and restart it with service elasticsearch start.

This is the output of the top command on a normal usage.

PID  USER      PR   NI   VIRT RES    SHR    S   %CPU %MEM   TIME+   COMMAND
8029 elastic+  20   0 7726444 4.538g 157816 S   4.0  59.4   3:45.94 java
4142 kibana    20   0 2411680 1.233g  13652 S   0.7  16.1   7:44.75 node

When it goes up to 100%, in the cluster log file, I see a lot of these rows:

[2018-09-13T16:07:40,206][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][old][1859][6] duration [13s], collections [1]/[13.3s], total [13s]/[41.7s], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [74.6mb]->[127.4mb]/[133.1mb]}{[survivor] [0b]->[0b]/[16.6mb]}{[old] [3.8gb]->[3.8gb]/[3.8gb]}
[2018-09-13T16:07:40,206][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][1859] overhead, spent [13s] collecting in the last [13.3s]

and the top command shows:

PID  USER      PR   NI   VIRT RES    SHR    S   %CPU %MEM   TIME+   COMMAND
8029 elastic+  20   0 7984924 4.562g 162520 S 100.0 59.7   8:13.52 java

The host acts as master and single data node.
It's a CentOS 7 with 8GB RAM, dual core CPU running jdk 1.8.0_144.
Everything is installed via rpm packages.

In the elasticsearch JVM options file I set:

-Xms4g
-Xmx4g

Can you help me investigate further?

Thank you

fkelbert · September 24, 2018, 1:05pm

Hi @steno,

This is obviously a garbage collection problem:

Make sure that swapping is disabled in your machine
Turn on Elasticsearch monitoring and use the Kibana monitoring tools to investigate the issue further

steno · September 25, 2018, 12:04pm

Thank you @fkelbert. Disabling swap modifying /etc/fstab improved time spent doing gc but did not solve completely. Now elasticsearch stays up and running but Kibana looks like freezed and shows updated data after more than 40 seconds (and sometimes it goes on request Timeout 30seconds).
From the cluster logs I can see that it is spending less time in gc tasks:

[2018-09-25T13:58:34,940][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][476] overhead, spent [500ms] collecting in the last [1s]
[2018-09-25T13:58:36,058][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][477] overhead, spent [660ms] collecting in the last [1.1s]
[2018-09-25T13:58:37,109][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][478] overhead, spent [614ms] collecting in the last [1s]
[2018-09-25T13:58:38,111][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][479] overhead, spent [551ms] collecting in the last [1s]
[2018-09-25T13:58:39,113][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][480] overhead, spent [582ms] collecting in the last [1s]
[2018-09-25T13:58:40,113][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][481] overhead, spent [586ms] collecting in the last [1s]
[2018-09-25T13:58:41,113][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][482] overhead, spent [577ms] collecting in the last [1s]
[2018-09-25T13:58:42,114][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][483] overhead, spent [581ms] collecting in the last [1s]
[2018-09-25T13:58:43,119][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][484] overhead, spent [594ms] collecting in the last [1s]
[2018-09-25T13:58:44,121][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][485] overhead, spent [609ms] collecting in the last [1s]
[2018-09-25T13:58:45,178][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][486] overhead, spent [641ms] collecting in the last [1s]
[2018-09-25T13:58:46,236][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][487] overhead, spent [634ms] collecting in the last [1s]
[2018-09-25T13:58:47,321][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][488] overhead, spent [658ms] collecting in the last [1s]
[2018-09-25T13:58:48,390][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][489] overhead, spent [653ms] collecting in the last [1s]
[2018-09-25T13:58:49,476][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][490] overhead, spent [657ms] collecting in the last [1s]
[2018-09-25T13:59:02,537][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][old][491][4] duration [12.5s], collections [1]/[13s], total [12.5s]/[12.8s], memory [5.7gb]->[3.9gb]/[5.9gb], all_pools {[young] [13.3mb]->[12.2mb]/[133.1mb]}{[survivor] [16.6mb]->[0b]/[16.6mb]}{[old] [5.7gb]->[3.9gb]/[5.8gb]}
[2018-09-25T13:59:02,537][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][491] overhead, spent [12.7s] collecting in the last [13s]
[2018-09-25T13:59:03,538][INFO ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][492] overhead, spent [491ms] collecting in the last [1s]
[2018-09-25T13:59:04,564][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][493] overhead, spent [716ms] collecting in the last [1s]
[2018-09-25T13:59:05,619][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][494] overhead, spent [633ms] collecting in the last [1s]
[2018-09-25T13:59:06,687][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][495] overhead, spent [622ms] collecting in the last [1s]
[2018-09-25T13:59:07,748][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][496] overhead, spent [640ms] collecting in the last [1s]
[2018-09-25T13:59:08,826][WARN ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][497] overhead, spent [635ms] collecting in the last [1s]
[2018-09-25T13:59:09,836][INFO ][o.e.m.j.JvmGcMonitorService] [o-KoSvH] [gc][498] overhead, spent [353ms] collecting in the last [1s]

but it is not enough. I don't know why this happens on 6.x and never happened on 5.6.
Is there something else I can do? Maybe on kibana side?

Thank you!

fkelbert · September 25, 2018, 12:26pm

Yes, these are serious GC times. You may be generally running low in memory. Depending on your data size and operations thereon, 8GB of memory might be too small.

If you don't have any bigger or additional machines available, what you can try is start a free 14-day Elastic Cloud trial, reindex your data into the new cluster, and observe the performance.

steno · September 26, 2018, 6:54am

It's disappointing that downgrading back to 5.6 solved. Newer versions should be less memory hungry. Looks like it is not, unfortunately. Even increasing memory to 12GB with heap size of 6GB for elasticsearch.
I'll try to investigate further with a no-prod machine.
Thank you for your support @fkelbert, anyway.

system · October 24, 2018, 6:54am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
100% CPU after the upgrade from 7.6.1 to 7.8.1 Elasticsearch	6	1486	October 28, 2020
Elasticsearch process using ~100% of CPU Elasticsearch	7	2838	July 5, 2017
ElasticSearch timeouts after upgrade to 6.5.1 Elasticsearch	19	1600	January 11, 2019
Sudden 100% CPU usage by ElasticSearch Elasticsearch	8	923	July 6, 2017
Sudden 100% CPU spike on a data node with Kibana becoming unresponsive Elasticsearch	2	1101	December 11, 2017

CPU 100% after upgrading from 5.6.5 to 6.4.0

Related topics