Elasticsearch process using ~100% of CPU

I've had an issue with CPU usage for some time now on a 2.1.0 then 2.1.1 cluster with one node. The service seems to take up almost 100% cpu time despite there being no activity on the cluster (i.e., kibana and logstash services stopped):

top - 10:45:56 up 3 days, 15:13,  2 users,  load average: 2.18, 1.73, 1.52
Tasks: 138 total,   2 running, 136 sleeping,   0 stopped,   0 zombie
%Cpu(s): 56.4 us,  1.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.5 si, 41.1 st
KiB Mem:   3854780 total,  3459524 used,   395256 free,   342280 buffers
KiB Swap:        0 total,        0 used,        0 free.  1041724 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 1309 elastic+  20   0 5514232 1.474g  63304 S 96.8 40.1   2868:11 java
 1211 dd-agent  20   0  210528  23032   4272 S  3.0  0.6  62:58.72 python

If I look in the hot threads output, only a small percentage is accounted for:

::: [asura-analytics00-test][fRwWxjs1To2qzIttd7EoLA][asura-analytics00-test][inet[/172.31.12.104:9300]]{master=true}
   Hot threads at 2016-04-11T10:48:43.151Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   19.6% (97.9ms out of 500ms) cpu usage by thread 'elasticsearch[asura-analytics00-test][listener][T#1]'
     7/10 snapshots sharing following 16 elements
       org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:308)
       org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
       org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:308)
       org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
       org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:254)
       org.elasticsearch.common.xcontent.support.AbstractXContentParser.map(AbstractXContentParser.java:208)
       org.elasticsearch.cluster.ClusterState.toXContent(ClusterState.java:431)
       org.elasticsearch.rest.action.admin.cluster.state.RestClusterStateAction$1.buildResponse(RestClusterStateAction.java:85)
       org.elasticsearch.rest.action.admin.cluster.state.RestClusterStateAction$1.buildResponse(RestClusterStateAction.java:80)
       org.elasticsearch.rest.action.support.RestBuilderListener.buildResponse(RestBuilderListener.java:37)
       org.elasticsearch.rest.action.support.RestResponseListener.processResponse(RestResponseListener.java:43)
       org.elasticsearch.rest.action.support.RestActionListener.onResponse(RestActionListener.java:49)
       org.elasticsearch.action.support.TransportAction$ThreadedActionListener$1.run(TransportAction.java:113)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:745)
     2/10 snapshots sharing following 13 elements
       org.elasticsearch.common.xcontent.XContentBuilder.writeValue(XContentBuilder.java:1197)
       org.elasticsearch.common.xcontent.XContentBuilder.writeMap(XContentBuilder.java:1163)
       org.elasticsearch.common.xcontent.XContentBuilder.map(XContentBuilder.java:1072)
       org.elasticsearch.cluster.ClusterState.toXContent(ClusterState.java:437)
       org.elasticsearch.rest.action.admin.cluster.state.RestClusterStateAction$1.buildResponse(RestClusterStateAction.java:85)
       org.elasticsearch.rest.action.admin.cluster.state.RestClusterStateAction$1.buildResponse(RestClusterStateAction.java:80)
       org.elasticsearch.rest.action.support.RestBuilderListener.buildResponse(RestBuilderListener.java:37)
       org.elasticsearch.rest.action.support.RestResponseListener.processResponse(RestResponseListener.java:43)
       org.elasticsearch.rest.action.support.RestActionListener.onResponse(RestActionListener.java:49)
       org.elasticsearch.action.support.TransportAction$ThreadedActionListener$1.run(TransportAction.java:113)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:745)
     unique snapshot (snipped)

If I check the log currently it's full of this, although I have solved a few prior issues producing the same ~100% cpu usage with different log output.

[2016-04-11 10:52:49,058][INFO ][monitor.jvm              ] [-test] [gc][old][300596][29556] duration [6.1s], collections [1]/[6.8s], total [6.1s]/[4.5h], memory [1.1gb]->[1.1gb]/[1.2gb], all_pools {[young] [24.4mb]->[22.8mb]/[66.5mb]}{[survivor] [0b]->[0b]/[8.3mb]}{[old] [1.1gb]->[1.1gb]/[1.1gb]}

What java version are you on?

Hi Mark,

Our server is reporting the following:

java version "1.7.0_79" OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-0ubuntu0.14.04.2) OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode) s

Recently I have also seen very high CPU from Elasticsearch to the point of affecting the performance of Kibana making it unusable. It is constantly above 100%. The system isn't live yet and has a small number of nodes reporting in. I'm finalising the project and will demo to the team very soon and hope I can fix it in time, so all help appreciated.

top - 10:29:59 up 16:26, 2 users, load average: 1.62, 2.32, 3.00
Tasks: 148 total, 1 running, 147 sleeping, 0 stopped, 0 zombie
Cpu(s): 26.6%us, 0.5%sy, 2.0%ni, 70.9%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 7727384k total, 6207292k used, 1520092k free, 128664k buffers
Swap: 1048572k total, 0k used, 1048572k free, 3204784k cached

The whole stack is on the same server, a VM with only 5 machines sending their logs using Topbeat, Filebeat, Packetbeat and Winlogbeat. I have met the server requirements for memory and CPU but also read on an issue thread that running a high number of network switch MTUs was required than the usual setting of 1500MTU. 9000 better. However such a change will affect a high number of machines so subject to change board, etc.
Whilst I'm considering that change as a requirement for a production system receiving numerous logs from many nodes, I currently have a small number and surprised to see the "threshold" reached so soon! Hopefully this is not the cause of Elasticsearch's high demand for CPU.

I have tried the latest 2.3.2 build that installed well my Centos server and CPU still high!
I ran curl -XGET 'localhost:9200' to verify version and it reported 2.2.1!

"cluster_name" : "elasticsearch",
"version" : {
"number" : "2.2.1",
"build_hash" : "d045fc29d1932bce18b2e65ab8b297fbf6cd41a1",
"build_timestamp" : "2016-03-09T09:38:54Z",
"build_snapshot" : false,
"lucene_version" : "5.4.1"

Can anyone advise anything for me to try? Many thanks.

Hi! You should install Marvel or any other monitoring solution to get a realistic view what's happening with cluster just before and at CPU spike. Another point is to monitor hot threads at high CPU (maybe you'll get any glue from here).

Thanks Rusty will have a look

These issues may sound similar but it's better if your create own thread for this.