Elasticsearch 5.x goes to YELLOW state more frequently due to high GC activity

We recently moved from ES 2.x to ES 5.4.1 and are seeing an increased GC activity on the data nodes. We're seeing frequent young GC happening and each cycle taking just less than a second. Due to lots of GC, data nodes are unresponsive to pings by Master node and as a result go out of the cluster only to join the cluster within a few minutes. But the cluster goes to YELLOW state and it takes time to recover the replica shards. One mitigation would be to increase the value of the setting "index.unassigned.node_left.delayed_timeout" but we want to make changes to reduce the GC pressure. This could be due to a lot of factors and we're looking at them one by one. For the purpose of this topic, we want to make sure our GC options are correctly set. For a data node having 28GB RAM, we've set the following options:

The last setting sets the size of new heap size to 4 GB. We've used this setting successfully on ES 2.x clusters and have carried on with this setting for ES 5.x clusters as well. With ES 2.x, we were using Java 1.7. With ES 5.x, we're using Java 1.8. My question is, do we need to set this option at all or can rely on the default settings? Also will enabling the GC logging settings commented out in jvm.options file by default help here? Let me know if more data is required.

You should unset the -Xmn option and leave it with default value.

Regarding memory pressure how many shards per node you cluster have? What is the average number of fields per mapping? Also, can you post here one of these log messages that shows young GC?

Sorry for the late reply @thiago. Are you suggesting removal of -Xmn option only because it is not a default setting or do you have anything else in mind? I'm asking this because we ran into similar GC issues on ES 2.x clusters long time back when we decided to set -Xmn to roughly 1/4th of memory allocated to heap. That brought stability to our clusters. On moving to ES 5.x, we did not make any change to this setting. Hence, I'm a little skeptical in simply unsetting this option unless something has changed in Java 1.8 which would improve GC with this option unset. Although, we're indexing roughly double the size of documents in ES 5.x compared to that in ES 2.x. That could be a factor as well causing increased GC.

We have around 50-70 shards per data node (including primary and replicas). We have two mappings - one has around 10-15 fields while the other has around 300-500 fields. Each index has only one mapping. Index containing mapping with 100s of fields has very small shards (not exceeding 5-10 GB). Index containing mapping with 10s of fields has huge shards (ranging from 10 to 500 GB). We're working on re-distributing the data so that these shard sizes are reduced.

Find log snippets around the time a data node went out of the cluster and soon joined back but left the cluster state in yellow state below:


Please find elasticsearch.yml and jvm.options files of the data node if interested at https://gist.github.com/bittusarkar/5c11174fe37db40687541b6975f3bd31

@thiago Do you have enough data to answer my queries?

Yes, I recommended removing the -Xmn setting because we don't fully test with that setting. But maybe @jasontedor may have a more deep explanation.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.