High HEAP and CPU utilization due to long and inefficient Garbage Collector (GC)

Hi all,
We constantly get Circuit Breaker errors and GC overhead logs in clusters with high indexation load. Looking at the Kibana dashboards and analyzing the logs, it is clear that the GC is one of our main bottlenecks. We tried to tune the default GC config but we had no success.
So, we'd like to get other hints from those who already had similar issues.

We are using Elasticsearch 7.10.2 running on openjdk version "15.0.1".
By default, Elasticsearch uses G1GC for this JVM version. The only JVM arg we manually set is the heap size, which is -Xms31232m -Xmx31232m.

Data nodes details:

  • Number of data nodes: 3
  • Heap: 32gb (-Xms31232m -Xmx31232m)
  • Overall memory: 64gb
  • CPUs: 16
  • 86 indices
  • 594 shards
  • 1.5M documents
  • 1.6 TB of data

Default GC config (considering the ones set by ES via jvm.options:

./jdk/bin/java -Xms31232m -Xmx31232m -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -XX:+PrintFlagsFinal -version | grep -iE "( NewSize | MaxNewSize | OldSize | NewRatio | ParallelGCThreads | MaxGCPauseMillis | ConcGCThreads | G1HeapRegionSize ) "
     uint ConcGCThreads                            = 3                                         {product} {ergonomic}
   size_t G1HeapRegionSize                         = 16777216                                  {product} {ergonomic}
    uintx MaxGCPauseMillis                         = 200                                       {product} {default}
   size_t MaxNewSize                               = 19646119936                               {product} {ergonomic}
    uintx NewRatio                                 = 2                                         {product} {default}
   size_t NewSize                                  = 1363144                                   {product} {default}
   size_t OldSize                                  = 5452592                                   {product} {default}
     uint ParallelGCThreads                        = 13                                        {product} {default}

Using the default JVM config, our cluster looks like this under heavy load:

The first attempt to tune GC was setting -XX:MaxGCPauseMillis=400 -XX:NewRatio=2:

./jdk/bin/java -Xms31232m -Xmx31232m -XX:MaxGCPauseMillis=400 -XX:NewRatio=2 -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -XX:+PrintFlagsFinal -version | grep -iE "( NewSize | MaxNewSize | OldSize | NewRatio | ParallelGCThreads | MaxGCPauseMillis | ConcGCThreads | G1HeapRegionSize ) "
     uint ConcGCThreads                            = 3                                         {product} {ergonomic}
   size_t G1HeapRegionSize                         = 16777216                                  {product} {ergonomic}
    uintx MaxGCPauseMillis                         = 400                                       {product} {command line}
   size_t MaxNewSize                               = 10905190400                               {product} {ergonomic}
    uintx NewRatio                                 = 2                                         {product} {command line}
   size_t NewSize                                  = 1363144                                   {product} {default}
   size_t OldSize                                  = 5452592                                   {product} {default}
     uint ParallelGCThreads                        = 13                                        {product} {default}

It made our heap to be always at the peak and the old GC to be much more frequent, probably because setting NewRatio=2 made the young pool memory to be reduced (which was a surprise to us, we expected it to increase). As a consequence, lots of circuit breakers and GC overhead logs.

Then we decided to let the JVM define the young pool size ergonomically, so we manually set only XX:MaxGCPauseMillis=400.

./jdk/bin/java -Xms31232m -Xmx31232m -XX:MaxGCPauseMillis=400 -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -XX:+PrintFlagsFinal -version | grep -iE "( NewSize | MaxNewSize | OldSize | NewRatio | ParallelGCThreads | MaxGCPauseMillis | ConcGCThreads | G1HeapRegionSize ) "
     uint ConcGCThreads                            = 3                                         {product} {ergonomic}
   size_t G1HeapRegionSize                         = 16777216                                  {product} {ergonomic}
    uintx MaxGCPauseMillis                         = 400                                       {product} {command line}
   size_t MaxNewSize                               = 19646119936                               {product} {ergonomic}
    uintx NewRatio                                 = 2                                         {product} {default}
   size_t NewSize                                  = 1363144                                   {product} {default}
   size_t OldSize                                  = 5452592                                   {product} {default}
     uint ParallelGCThreads                        = 13                                        {product} {default}

The idea was to reduce the frequency of GCs and make them more efficient. It solved the old GC collection frequency issue, but the young GC started to be very high again. As consequence, again, lots of CBs and GC overhead.

Can someone suspect of anything else? Is there any other GC setting to take into consideration?

Thanks in advance.

Elasticsearch 7.10 is EOL and no longer supported. Please upgrade ASAP.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

What is the output from the _cluster/stats?pretty&human API?

Also, as the bot points out you are running a very old version and you need to upgrade.

Hi, here it is:

{
    "_nodes": {
        "total": 6,
        "successful": 6,
        "failed": 0
    },
    "cluster_name": "es",
    "cluster_uuid": "qbo1AooITfis4t-hPhtNVA",
    "timestamp": 1675675482059,
    "status": "green",
    "indices": {
        "count": 85,
        "shards": {
            "total": 562,
            "primaries": 281,
            "replication": 1.0,
            "index": {
                "shards": {
                    "min": 2,
                    "max": 64,
                    "avg": 6.6117647058823525
                },
                "primaries": {
                    "min": 1,
                    "max": 32,
                    "avg": 3.3058823529411763
                },
                "replication": {
                    "min": 1.0,
                    "max": 1.0,
                    "avg": 1.0
                }
            }
        },
        "docs": {
            "count": 1458984322,
            "deleted": 218746906
        },
        "store": {
            "size_in_bytes": 1751768912033,
            "reserved_in_bytes": 0
        },
        "fielddata": {
            "memory_size_in_bytes": 1370917956,
            "evictions": 0
        },
        "query_cache": {
            "memory_size_in_bytes": 0,
            "total_count": 0,
            "hit_count": 0,
            "miss_count": 0,
            "cache_size": 0,
            "cache_count": 0,
            "evictions": 0
        },
        "completion": {
            "size_in_bytes": 0
        },
        "segments": {
            "count": 8794,
            "memory_in_bytes": 1612354584,
            "terms_memory_in_bytes": 1154059024,
            "stored_fields_memory_in_bytes": 4909856,
            "term_vectors_memory_in_bytes": 0,
            "norms_memory_in_bytes": 77515136,
            "points_memory_in_bytes": 0,
            "doc_values_memory_in_bytes": 375870568,
            "index_writer_memory_in_bytes": 4206496266,
            "version_map_memory_in_bytes": 1289224,
            "fixed_bit_set_memory_in_bytes": 550059416,
            "max_unsafe_auto_id_timestamp": 1675383574505,
            "file_sizes": {}
        },
        "mappings": {
            "field_types": [
                {
                    "name": "boolean",
                    "count": 303,
                    "index_count": 27
                },
                {
                    "name": "date",
                    "count": 144,
                    "index_count": 84
                },
                {
                    "name": "float",
                    "count": 136,
                    "index_count": 15
                },
                {
                    "name": "geo_point",
                    "count": 36,
                    "index_count": 36
                },
                {
                    "name": "geo_shape",
                    "count": 21,
                    "index_count": 21
                },
                {
                    "name": "join",
                    "count": 62,
                    "index_count": 62
                },
                {
                    "name": "keyword",
                    "count": 19416,
                    "index_count": 85
                },
                {
                    "name": "long",
                    "count": 14066,
                    "index_count": 85
                },
                {
                    "name": "nested",
                    "count": 470,
                    "index_count": 62
                },
                {
                    "name": "object",
                    "count": 5652,
                    "index_count": 85
                },
                {
                    "name": "text",
                    "count": 5892,
                    "index_count": 85
                }
            ]
        },
        "analysis": {
            "char_filter_types": [
                {
                    "name": "mapping",
                    "count": 80,
                    "index_count": 80
                }
            ],
            "tokenizer_types": [],
            "filter_types": [],
            "analyzer_types": [
                {
                    "name": "custom",
                    "count": 80,
                    "index_count": 80
                }
            ],
            "built_in_char_filters": [
                {
                    "name": "icu_normalizer",
                    "count": 80,
                    "index_count": 80
                }
            ],
            "built_in_tokenizers": [
                {
                    "name": "icu_tokenizer",
                    "count": 80,
                    "index_count": 80
                }
            ],
            "built_in_filters": [
                {
                    "name": "icu_folding",
                    "count": 80,
                    "index_count": 80
                }
            ],
            "built_in_analyzers": []
        }
    },
    "nodes": {
        "count": {
            "total": 6,
            "coordinating_only": 0,
            "data": 3,
            "data_cold": 3,
            "data_content": 3,
            "data_hot": 3,
            "data_warm": 3,
            "ingest": 6,
            "master": 3,
            "ml": 0,
            "remote_cluster_client": 6,
            "transform": 3,
            "voting_only": 0
        },
        "versions": [
            "7.10.2"
        ],
        "os": {
            "available_processors": 54,
            "allocated_processors": 54,
            "names": [
                {
                    "name": "Linux",
                    "count": 6
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "CentOS Linux 8",
                    "count": 6
                }
            ],
            "mem": {
                "total_in_bytes": 231928233984,
                "free_in_bytes": 11414028288,
                "used_in_bytes": 220514205696,
                "free_percent": 5,
                "used_percent": 95
            }
        },
        "process": {
            "cpu": {
                "percent": 11
            },
            "open_file_descriptors": {
                "min": 426,
                "max": 2865,
                "avg": 1636
            }
        },
        "jvm": {
            "max_uptime_in_millis": 240679452,
            "versions": [
                {
                    "version": "15.0.1",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "15.0.1+9",
                    "vm_vendor": "AdoptOpenJDK",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 6
                }
            ],
            "mem": {
                "heap_used_in_bytes": 58381384736,
                "heap_max_in_bytes": 111132278784
            },
            "threads": 443
        },
        "fs": {
            "total_in_bytes": 5629445271552,
            "free_in_bytes": 3819540131840,
            "available_in_bytes": 3819439468544
        },
        "plugins": [
            {
                "name": "repository-azure",
                "version": "7.10.2",
                "elasticsearch_version": "7.10.2",
                "java_version": "1.8",
                "description": "The Azure Repository plugin adds support for Azure storage repositories.",
                "classname": "org.elasticsearch.repositories.azure.AzureRepositoryPlugin",
                "extended_plugins": [],
                "has_native_controller": false
            },
            {
                "name": "analysis-icu",
                "version": "7.10.2",
                "elasticsearch_version": "7.10.2",
                "java_version": "1.8",
                "description": "The ICU Analysis plugin integrates the Lucene ICU module into Elasticsearch, adding ICU-related analysis components.",
                "classname": "org.elasticsearch.plugin.analysis.icu.AnalysisICUPlugin",
                "extended_plugins": [],
                "has_native_controller": false
            },
            {
                "name": "repository-s3",
                "version": "7.10.2",
                "elasticsearch_version": "7.10.2",
                "java_version": "1.8",
                "description": "The S3 repository plugin adds S3 repositories",
                "classname": "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
                "extended_plugins": [],
                "has_native_controller": false
            },
            {
                "name": "repository-gcs",
                "version": "7.10.2",
                "elasticsearch_version": "7.10.2",
                "java_version": "1.8",
                "description": "The GCS repository plugin adds Google Cloud Storage support for repositories.",
                "classname": "org.elasticsearch.repositories.gcs.GoogleCloudStoragePlugin",
                "extended_plugins": [],
                "has_native_controller": false
            }
        ],
        "network_types": {
            "transport_types": {
                "netty4": 6
            },
            "http_types": {
                "netty4": 6
            }
        },
        "discovery_types": {
            "zen": 6
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "docker",
                "count": 6
            }
        ],
        "ingest": {
            "number_of_pipelines": 0,
            "processor_stats": {}
        }
    }
}

A few things;

1 Like

Actually, we define the heap as 31232m, which is not precisely 32gb.
I can see logs like heap size [30.5gb], compressed ordinary object pointers [true], which I suppose indicates we are under the threshold.
We are planning to upgrade the ES version, but unfortunately it will not be completed soon as it will require a big effort.
That's why we are trying to find out settings to tune for improving the cluster stability asap.

Regarding the GC config, do you think the default one is a good fit for us considering our ES and JVM versions as well as cluster size? Is there any other JVM setting you recommend to try to better configure GC?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.