We had an incident in our ES cluster of 98 data nodes and 3 master nodes. The version of ES is 2.4.2. The cause of the incident was multiple aggregations on an analyzed field run on very big indices. Two nodes had OutOfMemory error and many nodes had their heap usage above 95% for hours.
We took a heap dump for one of the Elasticsearch data node using jmap. There were continuous Full GC's on this node for almost 3 hours before the heap dump was taken. Also, fielddata circuit breaker tripped many times on this node before any GC activity logged on the elasticsearch log. For almost 3 hours, the GC activity shows almost nothing can be freed from the heap, as shown in the below excerpt from the log file (the ip is replaced by 10.x.x.x).
[2017-03-14 12:19:53,625][INFO ][monitor.jvm ] [10.x.x.x] [gc][old][1186394][1152] duration [9s], collections [1]/[9.2s], total [9s]/[2.1h], memory [15.4gb]->[15.3gb]/[15.8gb], all_pools {[young] [488.9mb]->[424.5mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [14.9gb]->[14.9gb]/[14.9gb]}
[2017-03-14 12:20:03,884][INFO ][monitor.jvm ] [10.x.x.x] [gc][old][1186395][1153] duration [9.9s], collections [1]/[10.4s], total [9.9s]/[2.1h], memory [15.3gb]->[15.4gb]/[15.8gb], all_pools {[young] [424.5mb]->[519.2mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [14.9gb]->[14.9gb]/[14.9gb]}
Right after the heap dump was taken, almost 8GB of heap was free (by jmap), as shown in the below excerpt from the log file (at 2017-03-14 12:53:01,937).
[2017-03-14 12:52:53,004][INFO ][monitor.jvm ] [10.x.x.x] [gc][old][1186459][1207] duration [9.3s], collections [1]/[9.7s], total [9.3s]/[2.6h], memory [15.4gb]->[15.3gb]/[15.8gb], all_pools {[young] [493mb]->[423mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [14.9gb]->[14.9gb]/[14.9gb]}
[2017-03-14 12:53:01,937][INFO ][monitor.jvm ] [10.x.x.x] [gc][old][1186460][1208] duration [8.2s], collections [1]/[8.9s], total [8.2s]/[2.6h], memory [15.3gb]->[7gb]/[15.8gb], all_pools {[young] [423mb]->[4.4mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [14.9gb]->[7gb]/[14.9gb]}
My question is: why was Elasticsearch unable to free up the heap even though a huge chunk of memory is occupied by garbages (as demonstrated by jmap)? Regarding JVM configuration, we use Elastic's recommendation. Each data node has the below configuration:
-Xms16g -Xmx16g -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -XX:-HeapDumpOnOutOfMemoryError
Many of the ES data nodes showed heap usage of > 95% for more than 3 hours making the cluster unavailable and slow. We were able to fix the issue by clearing the caches that brought back the heap usages to < 75%. Also, we disabled aggregation on the field that caused the issue.