Identical data nodes with widely different memory behaviour


(John Strömblom) #1

Hello,

For some reason the heap fills very fast on two of our data nodes, while the other two behaves normal.

I have a cluster with 9 nodes:
2 coordinator,
3 master,
4 data

ES Version: 6.4.0

Running on ubuntu 16.04.4.

Java version:

openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)

Our four data nodes have the same configuration, and the hardware is identical.
The hardware looks like this:
28 gb RAM,
6 CPUs
SSD

ES is configured to use 14gb heap:

-Xms14g
-Xmx14g

And this is the GC configuration:

-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

This graph shows heap usage:

So node 01 and 02 are slowly using up the heap, and then gc kicks in at 75%.
Node 03 and 04 on the other hand use up the heap fast, and then gc kicks in at 75%.

03 and 04 also log a ton of these:

[2018-11-28T10:43:10,540][INFO ][o.e.m.j.JvmGcMonitorService] [escl01data04] [gc][141] overhead, spent [292ms] collecting in the last [1s]

All of our indices have 2 primary shards and 2 replica, and this is evenly distributed over the four nodes.

Anyone have any idea what might cause this?

I have compared jvm.options, elasticsearch.yml, java version, elastic version and service configuration. All are identical.

I've also checked amount of connections to the nodes, and they seem to look the same for each machine. Same with thread count, hoovering around 500ish threads.

And I've already tried restarting all nodes (not just the data nodes).


(Simon Willnauer) #2

are you using _update requests for indexing?


(John Strömblom) #3

No. We just found the culprit. We had a bad aggregation running. The interesting part is that it was one shard that experienced the problem. So whatever node it was living on got the memory leak.


(Simon Willnauer) #4

thanks for bringing closure. Yet, is this agg a built-in one?


(John Strömblom) #5

This was the agg:

"Category": {
      "nested": {
        "path": "categories"
      },
      "aggs": {
        "Category": {
          "terms": {
            "field": "categories.categoryId",
            "size": 2147483647
          },
          "aggs": {
            "Name": {
              "terms": {
                "field": "categories.name.raw",
                "size": 2147483647
              }
            }
          }
        }
      }
    }

The name agg caused the problem.

The query returned 6500 docs, and the category id agg returned 3600 buckets.