Sudden 100% CPU spike on a data node with Kibana becoming unresponsive

Hello,

We have an ElasticSearch cluster with 3 master nodes and 4 data nodes. Earlier today, one of the data nodes, 04, suddenly reached 100% CPU usage and stopped responding to any requests, including Kibana metrics requests. Kibana was also locked up and stopped responding to requests as well during the outage. This caused the whole cluster to return 500s on all search requests. Restarting the data node 04 fixed the issue. Afterwards, we investigated the logs on 04 and Kibana but cannot find any indication why the data node 04 suddenly maxed out and stopped responding.

Here is the logs right before we restarted the data node 04:

[2017-11-13T06:26:18,418][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24382372] overhead, spent [250ms] collecting in the last [1s]
[2017-11-13T06:26:19,421][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24382373] overhead, spent [364ms] collecting in the last [1s]
[2017-11-13T19:08:12,926][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428074] overhead, spent [267ms] collecting in the last [1s]
[2017-11-13T19:09:31,309][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428152] overhead, spent [272ms] collecting in the last [1s]
[2017-11-13T19:09:48,472][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428169] overhead, spent [312ms] collecting in the last [1s]
[2017-11-13T19:10:21,652][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428202] overhead, spent [255ms] collecting in the last [1s]
[2017-11-13T19:14:22,400][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428442] overhead, spent [258ms] collecting in the last [1s]
[2017-11-13T19:15:51,910][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428531] overhead, spent [317ms] collecting in the last [1s]
[2017-11-13T19:19:21,690][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428740] overhead, spent [255ms] collecting in the last [1s]
[2017-11-13T19:19:34,708][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428753] overhead, spent [261ms] collecting in the last [1s]
[2017-11-13T19:19:35,668][INFO ][o.e.n.Node               ] [search_data_04] stopping ...

Two questions:

Anything we can try to prevent this from happening in the future?
Anyway we can prevent Kibana from not responding / locking up when one of the data nodes has maxed out CPUs?

{ 
    "version" : {
    "number" : "5.1.2",
    "build_hash" : "c8c4c16",
    "build_date" : "2017-01-11T20:18:39.146Z",
    "build_snapshot" : false,
    "lucene_version" : "6.3.0"
  },
}

Update: Right before the outage happened, we were seeing a spike in the number of searches per second. Could the sudden increase of number of searches per second cause data node 04 to crash?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.