Elasticsearch unable to free up heap but taking heap dump freed up 8GB

We had an incident in our ES cluster of 98 data nodes and 3 master nodes. The version of ES is 2.4.2. The cause of the incident was multiple aggregations on an analyzed field run on very big indices. Two nodes had OutOfMemory error and many nodes had their heap usage above 95% for hours.

We took a heap dump for one of the Elasticsearch data node using jmap. There were continuous Full GC's on this node for almost 3 hours before the heap dump was taken. Also, fielddata circuit breaker tripped many times on this node before any GC activity logged on the elasticsearch log. For almost 3 hours, the GC activity shows almost nothing can be freed from the heap, as shown in the below excerpt from the log file (the ip is replaced by 10.x.x.x).

[2017-03-14 12:19:53,625][INFO ][monitor.jvm ] [10.x.x.x] [gc][old][1186394][1152] duration [9s], collections [1]/[9.2s], total [9s]/[2.1h], memory [15.4gb]->[15.3gb]/[15.8gb], all_pools {[young] [488.9mb]->[424.5mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [14.9gb]->[14.9gb]/[14.9gb]}
[2017-03-14 12:20:03,884][INFO ][monitor.jvm ] [10.x.x.x] [gc][old][1186395][1153] duration [9.9s], collections [1]/[10.4s], total [9.9s]/[2.1h], memory [15.3gb]->[15.4gb]/[15.8gb], all_pools {[young] [424.5mb]->[519.2mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [14.9gb]->[14.9gb]/[14.9gb]}

Right after the heap dump was taken, almost 8GB of heap was free (by jmap), as shown in the below excerpt from the log file (at 2017-03-14 12:53:01,937).

[2017-03-14 12:52:53,004][INFO ][monitor.jvm ] [10.x.x.x] [gc][old][1186459][1207] duration [9.3s], collections [1]/[9.7s], total [9.3s]/[2.6h], memory [15.4gb]->[15.3gb]/[15.8gb], all_pools {[young] [493mb]->[423mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [14.9gb]->[14.9gb]/[14.9gb]}
[2017-03-14 12:53:01,937][INFO ][monitor.jvm ] [10.x.x.x] [gc][old][1186460][1208] duration [8.2s], collections [1]/[8.9s], total [8.2s]/[2.6h], memory [15.3gb]->[7gb]/[15.8gb], all_pools {[young] [423mb]->[4.4mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [14.9gb]->[7gb]/[14.9gb]}

My question is: why was Elasticsearch unable to free up the heap even though a huge chunk of memory is occupied by garbages (as demonstrated by jmap)? Regarding JVM configuration, we use Elastic's recommendation. Each data node has the below configuration:

-Xms16g -Xmx16g -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -XX:-HeapDumpOnOutOfMemoryError

Many of the ES data nodes showed heap usage of > 95% for more than 3 hours making the cluster unavailable and slow. We were able to fix the issue by clearing the caches that brought back the heap usages to < 75%. Also, we disabled aggregation on the field that caused the issue.

Mesfin,

How are you determining that "a huge chunk of memory is occupied by garbage"?
It's sounds very much like a huge chunk of memory is occupied by fielddata cache, as you seem to have figured out and corrected yourself.

Also, we disabled aggregation on the field that caused the issue.

Has this issue returned since?

Thanks Glen for the response.

Just after the heap dump was taken, 8GB of heap was freed (as you can see from the log excerpt above. See also the attached image). I thought this happened because jmap forced full GC. That's why I thought huge chunk of memory (i.e. 8GB) was occupied by garbage. Hence, my question: why Elasticsearch couldn't free up the memory even if full GC continuously ran for 2+ hrs.

The issue hasn't returned since.

nesfin,

Full GC didn't run continuously for 2+ hours. It had run an aggregate of that duration over the up time of the JVM. (You may be saying that much of that aggregate time was in many such GC events over the previous few hours, which I certainly believe - when it's not being released, it will keep trying.)

Right after the heap dump was taken, almost 8GB of heap was free (by jmap), as shown in the below excerpt from the log file (at 2017-03-14 12:53:01,937).

I don't have an explanation for that, and would be surprised if the sole act of taking a heap dump with jmap would flush the cache.

We were able to fix the issue by clearing the caches that brought back the heap usages to < 75%.

This is what I would expect. If you are finding that necessary, though, you should address the source of what's filling your fielddata cache (which you did). Keep in mind that, by default it's unbounded

You may also already be aware that if you change the problematic field to not_analyzed, it will use doc_values by default, and no longer be loaded into fielddata cache.

As mentioned in the link above, you can monitor per-node fielddata using the node stats API. You can also monitor the amount of fielddata being used per-node, per-field, using the cat fielddata API.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.