Hello,
We have an ElasticSearch cluster with 3 master nodes and 4 data nodes. Earlier today, one of the data nodes, 04, suddenly reached 100% CPU usage and stopped responding to any requests, including Kibana metrics requests. Kibana was also locked up and stopped responding to requests as well during the outage. This caused the whole cluster to return 500s on all search requests. Restarting the data node 04 fixed the issue. Afterwards, we investigated the logs on 04 and Kibana but cannot find any indication why the data node 04 suddenly maxed out and stopped responding.
Here is the logs right before we restarted the data node 04:
[2017-11-13T06:26:18,418][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24382372] overhead, spent [250ms] collecting in the last [1s]
[2017-11-13T06:26:19,421][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24382373] overhead, spent [364ms] collecting in the last [1s]
[2017-11-13T19:08:12,926][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428074] overhead, spent [267ms] collecting in the last [1s]
[2017-11-13T19:09:31,309][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428152] overhead, spent [272ms] collecting in the last [1s]
[2017-11-13T19:09:48,472][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428169] overhead, spent [312ms] collecting in the last [1s]
[2017-11-13T19:10:21,652][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428202] overhead, spent [255ms] collecting in the last [1s]
[2017-11-13T19:14:22,400][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428442] overhead, spent [258ms] collecting in the last [1s]
[2017-11-13T19:15:51,910][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428531] overhead, spent [317ms] collecting in the last [1s]
[2017-11-13T19:19:21,690][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428740] overhead, spent [255ms] collecting in the last [1s]
[2017-11-13T19:19:34,708][INFO ][o.e.m.j.JvmGcMonitorService] [search_data_04] [gc][24428753] overhead, spent [261ms] collecting in the last [1s]
[2017-11-13T19:19:35,668][INFO ][o.e.n.Node ] [search_data_04] stopping ...
Two questions:
Anything we can try to prevent this from happening in the future?
Anyway we can prevent Kibana from not responding / locking up when one of the data nodes has maxed out CPUs?
{
"version" : {
"number" : "5.1.2",
"build_hash" : "c8c4c16",
"build_date" : "2017-01-11T20:18:39.146Z",
"build_snapshot" : false,
"lucene_version" : "6.3.0"
},
}