Elasticsearch High CPU Usage - GC Not Working

Michael_Sander · June 23, 2016, 6:34pm

I am running a 5 node ES 1.7.5 cluster and have been running into the same issue for the past few weeks: after running perfectly smoothly for about a week, the nodes start using 100% CPU and become unresponsive. I normally have to manually restart the nodes to get them working again.

It appears as if the GC is not working properly, here is a line from the logs:

[2016-06-23 17:44:23,621][WARN ][monitor.jvm] 
[elasticsearch1-elasticsearch-5-vm] [gc][old][218473][270] 
duration [1.2m], collections [1]/[1.2m], total [1.2m]/[17.7m], 
memory [25.2gb]->[25.2gb]/[25.9gb], 
all_pools {[young] [196.5mb]->[141.6mb]/[532.5mb]}{[survivor] 
[66.5mb]->[0b]/[66.5mb]}{[old] [25gb]->[25gb]/[25.3gb]}

I looked at my fielddata shortly before and it appears to be only around 1.5GB, so I don't think field data is taking up so much heap. What else would be using so much heap?

Is the solution to this simply adding more memory to the system? It seems as if there is a memory leak, as this problem takes a few days to appear.

Anyone have an idea how to debug this?

jasontedor · June 23, 2016, 9:01pm

Can you dump the heap and see what's consuming it?

Michael_Sander · June 23, 2016, 9:40pm

I keep getting the following errors whenever I use jmap:

$ sudo jmap -dump:format=b,file=heap.hprof 25945
25945: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding

$ sudo jmap -F -dump:format=b,file=heap.hprof 25945
Attaching to process ID 25945, please wait...
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at sun.tools.jmap.JMap.runTool(JMap.java:201)
        at sun.tools.jmap.JMap.main(JMap.java:130)
Caused by: java.lang.RuntimeException: zero stride: cannot read types.
        at sun.jvm.hotspot.HotSpotTypeDataBase.readVMTypes(HotSpotTypeDataBase.java:173)
        at sun.jvm.hotspot.HotSpotTypeDataBase.<init>(HotSpotTypeDataBase.java:89)
        at sun.jvm.hotspot.HotSpotAgent.setupVM(HotSpotAgent.java:395)
        at sun.jvm.hotspot.HotSpotAgent.go(HotSpotAgent.java:305)
        at sun.jvm.hotspot.HotSpotAgent.attach(HotSpotAgent.java:140)
        at sun.jvm.hotspot.tools.Tool.start(Tool.java:185)
        at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
        at sun.jvm.hotspot.tools.HeapDumper.main(HeapDumper.java:83)
        ... 6 more

Michael_Sander · June 23, 2016, 9:58pm

Okay, I got through these errors and now have a massive heap dump. What do I do now?

For others who come across this, here's what I did to get the heap dump:

# Get jmap
sudo apt-get install openjdk-8-jdk
# Get the PID of the elasticsearch instance:
ps -aux|grep elasticsearch|grep -v grep
# Switch into root
sudo su
# Switch into the elasticsearch user
su elasticsearch
# Dump the heap (you better have enough disk space for it)
jmap -dump:format=b,file=heap.hprof <PID_FROM_STEP_1>

jasontedor · June 23, 2016, 9:58pm

You need to execute jmap as the user running the Elasticsearch process.

jasontedor · June 23, 2016, 9:59pm

Open it in VisualVM and inspect what is taking up all the space in the heap.

Michael_Sander · June 24, 2016, 1:55am

Okay, loaded the heap dump into VisualVM and I got the following output. It looks like the vast majority of the heap is taken by org.apache.lucene.search.ScoreDoc. From it's name, I assume that is keeping a score for every document in memory, no? Is this normal? Any way to reduce this data memory load?

jasontedor · June 24, 2016, 2:36am

It holds a single search hit. Can you click through to several instances and show nearest GC root and see what's holding on to them?

Michael_Sander · June 24, 2016, 2:42am

It's a 17GB dump and my computer is not very happy processing it. VisualVM is taking literally 30 minutes to respond to each click.

I suspect that the heavy heap use may be a result of overly aggressive caching in ES 1.x as described in the documentation. Could that explain this behavior? If so, I may try setting _cache to false in my queries to see if that makes a difference.

jasontedor · June 24, 2016, 3:13am

You can increase the heap assigned to VisualVM. Start it with the flag -J-Xmx8g, for example, to increase the heap to 8 GB (go as high as you can up to your physical RAM).

I'm not convinced.

Michael_Sander · June 24, 2016, 6:23am

Okay, clicked on a bunch and they all say "No GC Root" (see below). Not sure what that means.

Here's a sample of the score doc object:

And here is another:

I can't tell if this is abnormal or not. Finally, this is the top 20 largest objects in the summary panel:

Can you determine anything from this? What else can I try?

Michael_Sander · June 24, 2016, 6:31am

Maybe it's the completion suggester that is causing the problems? Below is /_stats/completion?pretty&human (abridged). The completion index is 25gb. If the completion suggester FST needs to be in each node's memory, that would be problematic. But I'm running five nodes, so if it's split across them, 5gb does not seem so bad.

{
  "_shards" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "indices" : {
    "da-prod7" : {
      "primaries" : {
        "completion" : {
          "size" : "12.4gb",
          "size_in_bytes" : 13374010920
        }
      },
      "total" : {
        "completion" : {
          "size" : "24.9gb",
          "size_in_bytes" : 26746738552
        }
      }
    }
  }
}

jasontedor · June 24, 2016, 10:53am

These hits are not held in the cache, and anyway they are not rooted (retained size is zero) so they are not what is preventing the GC from freeing space in your heap.

Your last VisualVM screenshot shows significant memory being retained by the completion suggester. That appears to be the culprit here.

Michael_Sander · June 27, 2016, 1:10am

Yes, it seems that way. Any suggestions on dealing with this other than increasing memory and/or reducing what I am putting in the suggester?

jasontedor · June 27, 2016, 4:10pm

Those are your only options here, suggesters are heap hungry.

Michael_Sander · June 27, 2016, 9:44pm

Hmm, this still doesn't make sense. The total completion suggester is 12.4gb. I have five nodes, five shards, and two shards per node (1 replication), so that means each shard should use about 5gb of memory for the completion searcher (12.4gb / 5 nodes * 2 nodes / shard). Each node has 54gb of memory, and I'm devoting 30gb to ES heap. Just 5gb of completion shouldn't put ES into this type of state.

Also, should I be getting OutOfMemory errors rather than just spinning cycles?

tswider · June 30, 2016, 7:19am

When I have had similar problem it was because of ttl running, so just asking do you ttl your indexes?

Michael_Sander · July 1, 2016, 2:54am

What is TTL? The terminal?

Michael_Sander · July 1, 2016, 3:15am

This crash happened again. This time I had Marvel / Kibana setup, so I have detailed logging. Some of the charts are below, but I converted the output to a PDF showing all events.

It's difficult to really understand what is happening. The charts show that right before the crash, memory increased a bit, and the JVM spent most of its time trying to garbage collect. This increased the load. The machine became relatively unresponsive.

The only culprit seems to be field data, which increased right before the crash:

But this increase seems relatively modest, a few hundred MB at most. It seems unclear why this would take down the entire node. Also, this memory only really increased at around 22:34, whereas the instability started a few minutes earlier. So this increased field data may be an effect of the error, not a cause of it.

This seems like an important issue because the instability of a single machine is affecting the stability of an entire cluster. This is because the node did not crash, but became relatively unresponsive. The master kicked the node out of the cluster, and then added it back again and again. The subsequent intense relocating made the entire cluster unresponsive. This breaks Elasticsearch's high availability assurances.

And for the sake of completeness, here are the log output right at the crash time of 22:33

[2016-07-01 02:33:09,415][INFO ][monitor.jvm              ] [elasticsearch1-elasticsearch-3-vm] [gc][young][272253][28134] duration [881ms], collections [1]/[1.7s], total [881ms]/[2.5h], memory [23.6gb]->[23.9gb]/[29.9gb], all_pools {[young] [77.6mb]->[8.9mb]/[532.5mb]}{[survivor] [66.5mb]->[66.5mb]/[66.5mb]}{[old] [23.4gb]->[23.8gb]/[29.3gb]}
[2016-07-01 02:33:10,584][WARN ][monitor.jvm              ] [elasticsearch1-elasticsearch-3-vm] [gc][young][272254][28135] duration [1s], collections [1]/[1.1s], total [1s]/[2.5h], memory [23.9gb]->[24.4gb]/[29.9gb], all_pools {[young] [8.9mb]->[6.8mb]/[532.5mb]}{[survivor] [66.5mb]->[66.5mb]/[66.5mb]}{[old] [23.8gb]->[24.4gb]/[29.3gb]}
[2016-07-01 02:33:27,199][WARN ][monitor.jvm              ] [elasticsearch1-elasticsearch-3-vm] [gc][young][272270][28138] duration [1s], collections [1]/[1.2s], total [1s]/[2.5h], memory [25gb]->[25.5gb]/[29.9gb], all_pools {[young] [21.4mb]->[50.2mb]/[532.5mb]}{[survivor] [66.5mb]->[66.5mb]/[66.5mb]}{[old] [24.9gb]->[25.4gb]/[29.3gb]}
[2016-07-01 02:33:33,712][INFO ][monitor.jvm              ] [elasticsearch1-elasticsearch-3-vm] [gc][young][272276][28139] duration [822ms], collections [1]/[1.5s], total [822ms]/[2.5h], memory [25.9gb]->[26gb]/[29.9gb], all_pools {[young] [453.5mb]->[161.4mb]/[532.5mb]}{[survivor] [66.5mb]->[66.5mb]/[66.5mb]}{[old] [25.4gb]->[25.8gb]/[29.3gb]}
[2016-07-01 02:33:34,958][WARN ][monitor.jvm              ] [elasticsearch1-elasticsearch-3-vm] [gc][young][272277][28140] duration [1s], collections [1]/[1.2s], total [1s]/[2.5h], memory [26gb]->[26.4gb]/[29.9gb], all_pools {[young] [161.4mb]->[18.9mb]/[532.5mb]}{[survivor] [66.5mb]->[66.5mb]/[66.5mb]}{[old] [25.8gb]->[26.3gb]/[29.3gb]}
[2016-07-01 02:33:36,210][INFO ][monitor.jvm              ] [elasticsearch1-elasticsearch-3-vm] [gc][young][272278][28141] duration [940ms], collections [1]/[1.2s], total [940ms]/[2.5h], memory [26.4gb]->[26.8gb]/[29.9gb], all_pools {[young] [18.9mb]->[48.9mb]/[532.5mb]}{[survivor] [66.5mb]->[66.5mb]/[66.5mb]}{[old] [26.3gb]->[26.7gb]/[29.3gb]}
[2016-07-01 02:33:44,758][INFO ][monitor.jvm              ] [elasticsearch1-elasticsearch-3-vm] [gc][young][272286][28143] duration [888ms], collections [1]/[1.5s], total [888ms]/[2.5h], memory [27gb]->[27.4gb]/[29.9gb], all_pools {[young] [63.1mb]->[3.1mb]/[532.5mb]}{[survivor] [66.5mb]->[66.5mb]/[66.5mb]}{[old] [26.8gb]->[27.3gb]/[29.3gb]}
[2016-07-01 02:33:46,338][INFO ][monitor.jvm              ] [elasticsearch1-elasticsearch-3-vm] [gc][young][272287][28144] duration [790ms], collections [1]/[1.5s], total [790ms]/[2.5h], memory [27.4gb]->[27.9gb]/[29.9gb], all_pools {[young] [3.1mb]->[98.9mb]/[532.5mb]}{[survivor] [66.5mb]->[66.5mb]/[66.5mb]}{[old] [27.3gb]->[27.8gb]/[29.3gb]}
[2016-07-01 02:33:47,538][WARN ][monitor.jvm              ] [elasticsearch1-elasticsearch-3-vm] [gc][young][272288][28145] duration [1s], collections [1]/[1.2s], total [1s]/[2.5h], memory [27.9gb]->[28.3gb]/[29.9gb], all_pools {[young] [98.9mb]->[3.7mb]/[532.5mb]}{[survivor] [66.5mb]->[66.5mb]/[66.5mb]}{[old] [27.8gb]->[28.3gb]/[29.3gb]}

Any other insights or recommendations into this issue?

Michael_Sander · July 8, 2016, 6:57pm

Can anyone help here? This continues to happen on a weekly basis to my cluster. It seems really bad that a single node can take down a cluster.

Topic		Replies	Views
GC failing to reduce heap memory usage Elasticsearch	10	765	July 6, 2017
Heap usage holds steady at max and GC does not run. Need to force restart the cluster Elasticsearch elastic-stack-monitoring	5	558	July 6, 2023
Garbage Collection Not Working Elasticsearch	5	2815	October 8, 2019
Elasticsearch GC timeout on data node Elasticsearch	2	393	August 10, 2021
High heap usage, old GC does not run consistently Elasticsearch	6	1779	March 30, 2021

Elasticsearch High CPU Usage - GC Not Working

Related topics