ES high CPU usage when idle

I have a 220GB read-only index (1node, 1 shard and 0 replicas) on ES2.4 running on EC2 (2CPU, 4GB RAM, 50% free SSD, 3GB heap).

Although no queries or indexing are running, I can see almost constant CPU usage around ~25% for the past two hours, during which the cluster/node was effectively idle. htop reports system loads of 0.51, 045 and 0.43 respectively, but iotop reports zero disk usage. It is ES' process that is using CPU, according to htop.

What is keeping my CPU busy? I can't find anything enlightening on hot_threads or _tasks.

GET /hot_threads response:

::: {esnode1}{KNojH4i7QNuY57ILD68cTw}{}{}
   Hot threads at 2017-06-01T15:58:56.219Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
    0.0% (113.1micros out of 500ms) cpu usage by thread 'elasticsearch[esnode1][transport_client_timer][T#1]{Hashed wheel timer #1}'
     10/10 snapshots sharing following 5 elements
       java.lang.Thread.sleep(Native Method)

Output from GET /_tasks:

    "nodes": {
        "KNojH4i7QNuY57ILD68cTw": {
            "name": "esnode1",
            "transport_address": "",
            "host": "",
            "ip": "",
            "tasks": {
                "KNojH4i7QNuY57ILD68cTw:4540": {
                    "node": "KNojH4i7QNuY57ILD68cTw",
                    "id": 4540,
                    "type": "transport",
                    "action": "cluster:monitor/tasks/lists",
                    "start_time_in_millis": 1496332684463,
                    "running_time_in_nanos": 91070
                "KNojH4i7QNuY57ILD68cTw:4541": {
                    "node": "KNojH4i7QNuY57ILD68cTw",
                    "id": 4541,
                    "type": "direct",
                    "action": "cluster:monitor/tasks/lists[n]",
                    "start_time_in_millis": 1496332684463,
                    "running_time_in_nanos": 32876,
                    "parent_task_id": "KNojH4i7QNuY57ILD68cTw:4540"

I can't check details about the tasks, as the number after the ":" on the task ID is incremented at every second so every GET _tasks returns different, incremented IDs (i.e. KNojH4i7QNuY57ILD68cTw:4504, KNojH4i7QNuY57ILD68cTw:4508, KNojH4i7QNuY57ILD68cTw:4514, etc)


I just encountered something similar, cpu usage would be on average 25% (going from 10% to 80% and back down every few seconds). Looking at the tasks lists didn't show anything most of the time, but once in a while it showed a bunch of bulk indexing jobs.

After some fiddleing it turns out it was x-packs cluster monitoring. After turning it off in elasticsearch and kibana CPU usage it down to 5% on average.

Can you share a bit about how you diagnosed it?

I'm at a loss as to how to look into this, although if I use a larger heap for this index the problem is seriously mitigated, to the point where CPU usage will drop quickly after a batch search instead of lingering around 25% for an hour or so...but I don't see any logs indicating lack of memory or similar, so I'm still baffled by what's actually causing this...

Lots of guesstimating took place, since I'm very new to ES :).

Basically I checked the running tasks. On cursory inspection they just showed the task query task itself, but I wasn't convinced that nothing else was happening when the percentage of cpu used was so high. Since the cpu usage of a task that top shows is just an average over the past polling period, my task queries were probably hitting the idle times of the ES process. When I then fired off the task query as quickly as possible bulk indexing jobs appeared every once in a while.

Since my ES instance is idle 99% of time time (an installation in my dev environment) the only other thing that was periodically accessing ES was x-pack with monitoring information. After turning off the monitoring both in elasticsearch.yml and kibana.yml, the CPU usage dropped.

From the symptoms you're describing, it might also be a number of other things:

  • Data shuffled between shards on the same node (might try to raise the throtteling limit and see if it gets better)
  • Your working set doesn't fit into the java heap, and ES needs to continuously swap out documents on heap and redo all the caching for the newly swapped in entries
  • Maybe some Java GC edge case is being hit, where it tries to free heap, but nothing is freeable so it goes into a loop? (I don't know enough about Java and its GCs to be able to say anything here for certain, just a guess. Although I'd be surprised if this were actually the case, since the Java GCs are quite mature and the ES devs would have tuned the java GC for performance).

Did you check what the disk IO looks like during the periods of high CPU usage? Check either with iotop, iostat or vmstat. iotop -o will actually show you the processes performing IO and iostat -ktx 3 will show how the IO subsystem is performing. vmstat will just show the blocks read and written in the bin and bout columns, but it gives a good first impression of how busy the system is with disk IO.

1 Like

Both CPU and I/O were reported by htop and iotop as basically inexistent. I'm not using any plugins.

Adding more memory to the node eliminated this behaviour.

It's a pity that, given that ES is so memory sensitive, you don't get a log message saying something as basic and straightforward as "you're running out of memory, best do something before the errors come raining"...or something, anything...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.