Sudden 100% CPU spike on a data node with Kibana becoming unresponsive

xgantan · November 13, 2017, 7:48pm

Hello,

We have an ElasticSearch cluster with 3 master nodes and 4 data nodes. Earlier today, one of the data nodes, 04, suddenly reached 100% CPU usage and stopped responding to any requests, including Kibana metrics requests. Kibana was also locked up and stopped responding to requests as well during the outage. This caused the whole cluster to return 500s on all search requests. Restarting the data node 04 fixed the issue. Afterwards, we investigated the logs on 04 and Kibana but cannot find any indication why the data node 04 suddenly maxed out and stopped responding.

Here is the logs right before we restarted the data node 04:

[2017-11-13T06:26:18,418][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24382372] overhead, spent [250ms] collecting in the last [1s]
[2017-11-13T06:26:19,421][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24382373] overhead, spent [364ms] collecting in the last [1s]
[2017-11-13T19:08:12,926][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428074] overhead, spent [267ms] collecting in the last [1s]
[2017-11-13T19:09:31,309][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428152] overhead, spent [272ms] collecting in the last [1s]
[2017-11-13T19:09:48,472][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428169] overhead, spent [312ms] collecting in the last [1s]
[2017-11-13T19:10:21,652][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428202] overhead, spent [255ms] collecting in the last [1s]
[2017-11-13T19:14:22,400][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428442] overhead, spent [258ms] collecting in the last [1s]
[2017-11-13T19:15:51,910][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428531] overhead, spent [317ms] collecting in the last [1s]
[2017-11-13T19:19:21,690][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428740] overhead, spent [255ms] collecting in the last [1s]
[2017-11-13T19:19:34,708][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428753] overhead, spent [261ms] collecting in the last [1s]
[2017-11-13T19:19:35,668][INFO ][o.e.n.Node               ] [search_data_04] stopping ...

Two questions:

Anything we can try to prevent this from happening in the future?
Anyway we can prevent Kibana from not responding / locking up when one of the data nodes has maxed out CPUs?

{ 
    "version" : {
    "number" : "5.1.2",
    "build_hash" : "c8c4c16",
    "build_date" : "2017-01-11T20:18:39.146Z",
    "build_snapshot" : false,
    "lucene_version" : "6.3.0"
  },
}

xgantan · November 13, 2017, 7:51pm

Update: Right before the outage happened, we were seeing a spike in the number of searches per second. Could the sudden increase of number of searches per second cause data node 04 to crash?

system · December 11, 2017, 7:51pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Getting sudden bursts of CPU Elasticsearch	3	1805	May 28, 2020
ElasticSearch server lock up Elasticsearch	23	1607	July 6, 2017
Cluster failure Elasticsearch	1	280	July 6, 2017
CPU 100% after upgrading from 5.6.5 to 6.4.0 Elasticsearch	5	1013	October 24, 2018
High CPU Usage in one node of cluster of 4 nodes Elasticsearch	1	1089	June 5, 2017

Sudden 100% CPU spike on a data node with Kibana becoming unresponsive

Related topics