Elasticsearch Garbage Collection issue

Elasticsearch version: 1.7.3 - 2.3.3

JVM version: "1.8.0_72"

OS version: Ubuntu 14.04

Description of the problem including expected versus actual behavior:

Running ES on a 5 node cluster cluster where each node has 15GB of RAM. The max heap size is set to 7GB on each server. Using plugins searchguard-ssl, stats, and head.

This is a screenshot of the heap usage on each node over a 10 day period. You can see the normal saw-tooth pattern, but the bottom of each cycle is rising. It seems that the garbage collector frees less and less memory every time that it runs. I was having a similar issue on ES 1.7.3 and hoped that bumping to 2.3.3 would help to alleviate, but no such luck.

Once the servers get up to ~6GB of heap used for ES we do a rolling restart of the cluster which restarts the whole cycle.

I recently posted this issue on github(https://github.com/elastic/elasticsearch/issues/19544) and was directed to the page on limiting memory usage(https://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html). The documentation leads me to believe that the field data is likely the culprit of our endlessly growing heap, however, the values for total field data used do not add up.

When I run the GET /_stats/fielddata?fields=* on my cluster I get:

"total":{
         "fielddata":{
            "memory_size_in_bytes":4739127224,
            "evictions":0,
            "fields":{
               "@tag":{
                  "memory_size_in_bytes":42626416
               },
               "user_id":{
                  "memory_size_in_bytes":14236968
               },
               "time":{
                  "memory_size_in_bytes":8049808
               },
               "buildNum":{
                  "memory_size_in_bytes":816
               },
               "created":{
                  "memory_size_in_bytes":1490114912
               },
               "@type":{
                  "memory_size_in_bytes":21128028
               },
               "@timestamp":{
                  "memory_size_in_bytes":3162970276
               }
            }
         }
      }

Based on these results it would seem that the total amount of memory used on field data is ~4.7GB. Based on the per node results I assume that this total is spread across all nodes in the cluster. But a segment of 4.7GB out of a total heap of 35GB (7GB per node across 5 nodes) doesn't seem like it would be the sole cause of such steep growth in our heap.

Is there something else that I may be overlooking with regards to the fielddata effect on heap usage?

1 Like

Are you using doc values?

Hey,

I have a number of old indices that are definitely not using doc_values. My plan is to create new indices and re-index the old ones into the new ones. I understand that by default new indices will use doc_values for all fields that are not indexed. I will post back here with the results.

Is there a way that I can verify if doc_values are being used for a given field?

Fields that are not analyzed.

Explicitly ask for them with "doc_values": true. I'm not sure of another way off hand.

Hey,

So I was definitely not using doc_values for the majority of my indices and documents. I didn't have templates set up for my logstash indices so most of the fields were defaulting to analyzed strings. Over the course of the last week I re-indexed all the indices in my staging environment and set up templates that would default any new string field to be not_analyzed.

The re-index process itself was pretty easy - love the new reindex api.
As a result of the re-indexing I have saved about 25% of disk space from my logstash indices (which go back ~2 months). However, I am still seeing the garbage collector reaping smaller sections of the heap with each subsequent collection.

I've only got about 4 days worth of data, but I'm concerned because I was hoping that changing 98% of the data to be not_analyzed (and therefore using doc_values by default) would have improved these symptoms.

Anything else that I may be missing? Happy to post any other cluster performance data if that would be helpful.

cheers

At this point I'm not sure what it could be. The best thing might be to
take a heap dump of the Java precess and crack it open in something like
MAT.

Ive taken two heap dumps (about 7-8hrs apart) that I've imported into VisualVM. Screenshots of the histograms below (in the order that they were taken):

All I can tell so far is that the byte[] is the largest and fastest growing class in terms of size and instances.

I will continue to take heap dumps over the next few days and keep comparing against the older dumps to try and locate where the largest leak is happening.

Please let me know if there are any classes that I should pay particular attention to as likely suspects.

cheers

We've taken a few more heap dumps over the last 24 hours. One thing is beginning to stand out.

The taskmanager map keeps growing. It has 1.5 million instances of ReplicationTask.

We've looked into a few of the task objects and we can see they have phase = finished.

Seems like maybe the finished tasks are not being removed, or maybe the tasks are somehow getting stuck in a weird state, and therefore not being reaped.

We will keep taking heap dumps, but so far this seems a likely culprit.

cheers

Just attempted to use the task api to pull down some information about running tasks and that OOM'ed my ES process (I assume because there are so many tasks in that map). Again, leading me to believe something is wonky with the task manager.

Yeah, that seems bad. Can you post and example of an older task? I'll open a bug and link this issue.

Sorry I didn't reply to you yesterday - I'd honestly not been paying attention. Sorry!

I opened

Thanks for opening the issue @nik9000! Will keep updating over on github moving forwards.

For the benefit of future readers: the issue was tracked down to a memory leak in a third-party plugin: search-guard-ssl.
For more details, go to the github issue linked by @nik9000 above.

1 Like