Identical data nodes with widely different memory behaviour

stromblom · November 28, 2018, 10:13am

Hello,

For some reason the heap fills very fast on two of our data nodes, while the other two behaves normal.

I have a cluster with 9 nodes:
2 coordinator,
3 master,
4 data

ES Version: 6.4.0

Running on ubuntu 16.04.4.

Java version:

openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)

Our four data nodes have the same configuration, and the hardware is identical.
The hardware looks like this:
28 gb RAM,
6 CPUs
SSD

ES is configured to use 14gb heap:

-Xms14g
-Xmx14g

And this is the GC configuration:

-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

This graph shows heap usage:

So node 01 and 02 are slowly using up the heap, and then gc kicks in at 75%.
Node 03 and 04 on the other hand use up the heap fast, and then gc kicks in at 75%.

03 and 04 also log a ton of these:

[2018-11-28T10:43:10,540][INFO ][o.e.m.j.JvmGcMonitorService] [escl01data04] [gc][141] overhead, spent [292ms] collecting in the last [1s]

All of our indices have 2 primary shards and 2 replica, and this is evenly distributed over the four nodes.

Anyone have any idea what might cause this?

I have compared jvm.options, elasticsearch.yml, java version, elastic version and service configuration. All are identical.

I've also checked amount of connections to the nodes, and they seem to look the same for each machine. Same with thread count, hoovering around 500ish threads.

And I've already tried restarting all nodes (not just the data nodes).

s1monw · November 30, 2018, 10:39am

are you using _update requests for indexing?

stromblom · November 30, 2018, 10:48am

No. We just found the culprit. We had a bad aggregation running. The interesting part is that it was one shard that experienced the problem. So whatever node it was living on got the memory leak.

s1monw · November 30, 2018, 10:50am

thanks for bringing closure. Yet, is this agg a built-in one?

stromblom · November 30, 2018, 10:50am

This was the agg:

"Category": {
      "nested": {
        "path": "categories"
      },
      "aggs": {
        "Category": {
          "terms": {
            "field": "categories.categoryId",
            "size": 2147483647
          },
          "aggs": {
            "Name": {
              "terms": {
                "field": "categories.name.raw",
                "size": 2147483647
              }
            }
          }
        }
      }
    }

The name agg caused the problem.

The query returned 6500 docs, and the category id agg returned 3600 buckets.

system · December 28, 2018, 10:53am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch data node JVM Running out of memory Elasticsearch	2	484	May 8, 2020
Heap memory leak in Elasticsearch 6.2.4 Elasticsearch	5	1882	March 3, 2020
JVM heap size usage and causes Elasticsearch	9	2045	September 25, 2019
GC running early? Elasticsearch	3	555	May 29, 2017
Heap Usage is not as usual Elasticsearch	6	788	July 3, 2017

Identical data nodes with widely different memory behaviour

Related topics