3-node cluster dying intermittently with CPU spike

nwood888 · January 5, 2017, 12:05am

I have a 3-node 5.1.1 cluster that I believe is sized correctly. In the past 2 weeks, it's locked up twice, ultimately reaching the queue capacity of 1000 and requiring a full cluster restart to bring it back online. Elastic is running inside docker containers on 3 ec2 instances in different availability zones.

I believe these are the only relevant Elastic logs from when this happened:

[2017-01-04T13:40:44,434][INFO ][o.e.m.j.JvmGcMonitorService] [es1] [gc][1118938] overhead, spent [294ms] collecting in the last [1.1s]
[2017-01-04T13:42:29,514][INFO ][o.e.c.m.MetaDataMappingService] [es1] [.monitoring-kibana-2-2017.01.04/0kTB19GeRg686SSUtC8l5g] update_mapping [kibana_stats]
[2017-01-04T13:43:33,268][WARN ][o.e.m.j.JvmGcMonitorService] [es1] [gc][1119105] overhead, spent [1.9s] collecting in the last [2.7s]
[2017-01-04T13:43:46,771][INFO ][o.e.n.Node ] [es1] initializing ...
[2017-01-04T13:43:46,841][INFO ][o.e.e.NodeEnvironment ] [es1] using [1] data paths, mounts [[/usr/share/elasticsearch/data (/dev/xvdf1)]], net usable_space [134.9gb], net total_space [147.5gb], spins? [possibly], types [ext4]
[2017-01-04T13:43:46,842][INFO ][o.e.e.NodeEnvironment ] [es1] heap size [6.9gb], compressed ordinary object pointers [true]
[2017-01-04T13:43:46,911][INFO ][o.e.n.Node ] [es1] node name [es1], node ID [3nX__vT2Q4aA-XjLh9m7Lw]
[2017-01-04T13:43:46,913][INFO ][o.e.n.Node ] [es1] version[5.1.1], pid[10], build[5395e21/2016-12-06T12:36:15.409Z], OS[Linux/4.4.35-33.55.amzn1.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_111/25.111-b14]

My Zabbix graphs show a CPU spike starting right around 13:40 on all 3 nodes. It looks like that might correspond with garbage collection from the logs, but I'm not sure. It then appears that the docker container restarted on it's own. I haven't found anything in the system logs to indicate why that might have happened.

Looking at the marvel graphs for the first (master) node at 13:40, heap was ~5GB out of 7GB. CPU was bumping around between 0 and 25% for the previous several hours. Index and search latency were low. Segment count was ~600.

Some thoughts that I've had but haven't finished exploring:

Maybe this is caused by a rogue query? If so, any tips on tracking this down?
The docker MTU is 1500 and the host primary network interface MTU is 9001. I've seen older references to this causing connectivity problems between nodes, but it seems like those issues might be worked out in recent version of Elastic and Docker.

I'm wondering how to troubleshoot this further as I'm not seeing anything extra in the logs. Any direction would be much appreciated.

Nick

warkolm · January 5, 2017, 12:10am

How many indices, shards and how much data?
What JVM?

nwood888 · January 5, 2017, 12:36am

44 indices
289 shards
23.3M documents
16GB data

These are c4.2xlarge instance types
Using the official "elasticsearch:5" docker image
JVM is: Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_111/25.111-b14

tronicum · January 12, 2017, 9:44am

I got this if i stream too much data into a docker cluster (1.8 gb access logs on a vmware with 24gb ram and 4 cpus)

system · February 9, 2017, 9:44am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sudden 100% high CPU spikes and degraded performance for 10-15min Elasticsearch	1	2828	March 30, 2018
One node frequently goes into 100% CPU and GC loop Elasticsearch	3	1059	July 5, 2017
Getting sudden bursts of CPU Elasticsearch	3	2009	May 28, 2020
ElasticSearch 2.3.4 grinding to a halt Elasticsearch	10	1358	July 5, 2017
Sudden 100% CPU spike on a data node with Kibana becoming unresponsive Elasticsearch	2	1113	December 11, 2017

3-node cluster dying intermittently with CPU spike

Related topics