Probable memory leak: Heap utilization stuck at ~max heap for idle cluster

Cluster setup : Version : 7.1.1
3 master node
3 data node (16core, 64Mb Ram, Xms=Xmx=12Gb each)
2 coordinating node
650 indexes with a shard each.
Total data=~2Tb

Query:
All my data nodes are stuck at ~10Gb heap usage. There is no searching or indexing being done.
Its unclear to me why data nodes are occupying ~10Gb heap in an idle state ?

  1. Heap dump says Class "B" occupies 7.77Gb space. Heap dump screenshot:

  2. Even accounting circuit breaker gives an higher estimate, but that is next question. Node stats:
    https://del.dog/yaqamoteyu.json

  3. Even thread dump say nothing. None of the thread seem to be doing any kind of work.

Shreyash,

First question: how have you configured replication for this cluster? That has a large effect on how much data is being stored per node.

Perhaps you have already read the blog post "How many shards should I have in my Elasticsearch cluster?", but if not it is probably worth a look.

Each shard has data that need to be kept in memory and use heap space. This includes data structures holding information at the shard level, but also at the segment level in order to define where data reside on disk. The size of these data structures is not fixed and will vary depending on the use-case. […] The more heap space a node has, the more data and shards it can handle.

Indices and shards are therefore not free from a cluster perspective, as there is some level of resource overhead for each index and shard.

In other words, you should expect that shards will use heap space even when they are not actively being queried or written to. Note that your heap dump shows many Elasticsearch "CacheSegment" objects. These may simply be what is required by your indices' mappings.

The reason I asked about replication is that it looks like your shard sizes are within the guidelines recommended by the "How many shards?" blog post, but if you have enabled replication, you may have more shards per node than recommended. It's very hard to say in the abstract, since so much depends on the data and mappings.

You may be interested in Index Lifecycle Management (ILM), a feature that was released after that blog post was written. If you have "old" indices that do not need to be queried frequently, you can let the cluster "freeze" them, which reduces the amount of heap space they use. Some links:

Does this mean that your cluster's heap usage is expected behavior? In truth, I do not know. It would help to know a little bit more about your index replication policy and your use case. Do all or most of the 650 indices have the same mappings, like you would see if you were indexing logs and breaking up your data by time?

I hope some of this is helpful to you.

-William

1 Like

Firstly, thanks for the good details. Appreciate that.

Yes.

Yes the replication is configured to a factor of 2.

My bad, specifically, 216 indices and 2 replicas. Each data node has 216 shards. In all, 648 shards. This goes good with what is mentioned in the blog per node.

Yes, all my indices have same mapping. My indices are not time-based. Once indexed they are not modified, just searched.

I am exploring this option.

Question:
The allocated shards are 216 per node is well within the proposed range as per blog. It is still not clear as to why each data node is occupying ~10Gb.
The cache segment you can see, per node, is just 25Mb.
Is there a way to know what class B is ?

I had output of /_segments, what i did was summed up all the "memory_in_bytes" from all nodes. This calculated to 23Gb. Is this field same as the memory used up in Heap by the segment ?

Have a look at this webinar which talks about optimizing for storage. This documentation is also a useful resource.

Shreyash,

Sorry for the delay in responding here.

I suspect that the class B[] in the heap dump is a byte array. J[] is an array of longs, and S[] is an array of strings. (Those codes are described in the javadocs for java.lang.Class#getName.)

I see this in the docs:

Segments need to store some data into memory in order to be searchable efficiently. This number returns the number of bytes that are used for that purpose.

…so I believe that the answer to your second question is yes, the memory_in_bytes value shows how much heap space the segment is using.

-William

Hi William,

Same problem for me ,
Total heap is consumed is 10 GB out of which 8 GB is consumed by Segment(terms bytes)
But I am not sure where is 2 GB is going on , Heap dump shows , as you mentioned class long and short is consuming about 800 MB and 200 MB resp.

Can you please help me to understand what else such information (long,short) ES holding in heap.

Thanks for the confirmation @William_Brafford.
Is there a way to see what does memory_in_bytes hold actually ?

From @Christian_Dahlqvist suggesions:
I have tried giving a thought to below 2 things:

  1. norms=false - I don't have it on my fields in mappings. I am going to put them right away, but are these normalized forms of text actually a part of metadata that is stored in-memory ?

  2. index_options="docs" - After running from sample tests on by creating docs i have noticed that the TermFrequencies and PositionalInformation is something that is generated in the indexes. It is just that it is not used while calculating scores. So how will this save the metadata stored in-memory ? In any case i am going to optimize this too.

@William_Brafford: Did you get a chance to look into this ?

@pokaleshrey,

I'm sorry for the delay in getting back to you.

Is there a way to see what does memory_in_bytes hold actually ?

There are two options that I know of. The index _stats api has a high-level summary of what the index is holding in memory. From my test cluster:

GET kibana_sample_data_flights/_stats

{
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      [...]
      "segments" : {
        "count" : 10,
        "memory_in_bytes" : 81110,
        "terms_memory_in_bytes" : 53288,
        "stored_fields_memory_in_bytes" : 4040,
        "term_vectors_memory_in_bytes" : 0,
        "norms_memory_in_bytes" : 0,
        "points_memory_in_bytes" : 886,
        "doc_values_memory_in_bytes" : 22896,
        "index_writer_memory_in_bytes" : 0,
        "version_map_memory_in_bytes" : 0,
        "fixed_bit_set_memory_in_bytes" : 0,
        "max_unsafe_auto_id_timestamp" : 1563803176248,
        "file_sizes" : { }
      },
      [...]
    },
   [...]
  },
  [...]
}

If want to see Lucene internals on a shard-by-shard basis, use the verbose=true option on an index _segments endpoint:

GET kibana_sample_data_flights/_segments?verbose=true

{
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "indices" : {
    "kibana_sample_data_flights" : {
      "shards" : {
        "0" : [
          {
            "routing" : {
              "state" : "STARTED",
              "primary" : false,
              "node" : "7MIorM8CQ8SU0OB4ySU3FQ"
            },
            "num_committed_segments" : 10,
            "num_search_segments" : 10,
            "segments" : {
              "_0" : {
                "generation" : 0,
                "num_docs" : 500,
                "deleted_docs" : 0,
                "size_in_bytes" : 290988,
                "memory_in_bytes" : 7256,
                "committed" : true,
                "search" : true,
                "version" : "8.0.0",
                "compound" : true,
                "ram_tree" : [
                  {
                    "description" : "postings [PerFieldPostings(segment=_0 formats=1)]",
                    "size_in_bytes" : 4924,
                    "children" : [...]
                  },
                  {
                    "description" : "docvalues [PerFieldDocValues(formats=1)]",
                    "size_in_bytes" : 1964,
                    "children" : [...]
                  },
                  {
                    "description" : "stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]",
                    "size_in_bytes" : 344,
                    "children" : [...]
                  },
                  {
                    "description" : "points [org.apache.lucene.codecs.lucene60.Lucene60PointsReader@4124869]",
                    "size_in_bytes" : 24,
                    "children" : [...]
                  }
                ],
                "attributes" : {
                  "Lucene50StoredFieldsFormat.mode" : "BEST_SPEED"
                }
              },
              [...]
            }
          }
        ]
      }
    }
  }
}

Note that the RAM tree may contain a lot of nested data, so be ready for a lot of output from this command. The output for my simple test index was about 6,000 lines of pretty-printed JSON. However, if you look at my example, the top-level entries in the ram_tree add up to the memory_in_bytes_value. Unfortunately, since this extra verbose output comes from a Lucene API, I am not an expert at interpreting it.

I hope this is helpful.

-William

Thanks @William_Brafford, That should be of great help even though big. We will take some time out, to interpret it. It has become of utmost importance for us to keep heap utilization at minimum.
If not we would be scaling the infra as next measure and reduce node sizes.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.