High HEAP and CPU utilization due to long and inefficient Garbage Collector (GC)

GustavoSantos · February 3, 2023, 3:17pm

Hi all,
We constantly get Circuit Breaker errors and GC overhead logs in clusters with high indexation load. Looking at the Kibana dashboards and analyzing the logs, it is clear that the GC is one of our main bottlenecks. We tried to tune the default GC config but we had no success.
So, we'd like to get other hints from those who already had similar issues.

We are using Elasticsearch 7.10.2 running on openjdk version "15.0.1".
By default, Elasticsearch uses G1GC for this JVM version. The only JVM arg we manually set is the heap size, which is -Xms31232m -Xmx31232m.

Data nodes details:

Number of data nodes: 3
Heap: 32gb (-Xms31232m -Xmx31232m)
Overall memory: 64gb
CPUs: 16
86 indices
594 shards
1.5M documents
1.6 TB of data

Default GC config (considering the ones set by ES via jvm.options:

./jdk/bin/java -Xms31232m -Xmx31232m -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -XX:+PrintFlagsFinal -version | grep -iE "( NewSize | MaxNewSize | OldSize | NewRatio | ParallelGCThreads | MaxGCPauseMillis | ConcGCThreads | G1HeapRegionSize ) "
     uint ConcGCThreads                            = 3                                         {product} {ergonomic}
   size_t G1HeapRegionSize                         = 16777216                                  {product} {ergonomic}
    uintx MaxGCPauseMillis                         = 200                                       {product} {default}
   size_t MaxNewSize                               = 19646119936                               {product} {ergonomic}
    uintx NewRatio                                 = 2                                         {product} {default}
   size_t NewSize                                  = 1363144                                   {product} {default}
   size_t OldSize                                  = 5452592                                   {product} {default}
     uint ParallelGCThreads                        = 13                                        {product} {default}

Using the default JVM config, our cluster looks like this under heavy load:

The first attempt to tune GC was setting -XX:MaxGCPauseMillis=400 -XX:NewRatio=2:

./jdk/bin/java -Xms31232m -Xmx31232m -XX:MaxGCPauseMillis=400 -XX:NewRatio=2 -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -XX:+PrintFlagsFinal -version | grep -iE "( NewSize | MaxNewSize | OldSize | NewRatio | ParallelGCThreads | MaxGCPauseMillis | ConcGCThreads | G1HeapRegionSize ) "
     uint ConcGCThreads                            = 3                                         {product} {ergonomic}
   size_t G1HeapRegionSize                         = 16777216                                  {product} {ergonomic}
    uintx MaxGCPauseMillis                         = 400                                       {product} {command line}
   size_t MaxNewSize                               = 10905190400                               {product} {ergonomic}
    uintx NewRatio                                 = 2                                         {product} {command line}
   size_t NewSize                                  = 1363144                                   {product} {default}
   size_t OldSize                                  = 5452592                                   {product} {default}
     uint ParallelGCThreads                        = 13                                        {product} {default}

It made our heap to be always at the peak and the old GC to be much more frequent, probably because setting NewRatio=2 made the young pool memory to be reduced (which was a surprise to us, we expected it to increase). As a consequence, lots of circuit breakers and GC overhead logs.

Then we decided to let the JVM define the young pool size ergonomically, so we manually set only XX:MaxGCPauseMillis=400.

./jdk/bin/java -Xms31232m -Xmx31232m -XX:MaxGCPauseMillis=400 -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -XX:+PrintFlagsFinal -version | grep -iE "( NewSize | MaxNewSize | OldSize | NewRatio | ParallelGCThreads | MaxGCPauseMillis | ConcGCThreads | G1HeapRegionSize ) "
     uint ConcGCThreads                            = 3                                         {product} {ergonomic}
   size_t G1HeapRegionSize                         = 16777216                                  {product} {ergonomic}
    uintx MaxGCPauseMillis                         = 400                                       {product} {command line}
   size_t MaxNewSize                               = 19646119936                               {product} {ergonomic}
    uintx NewRatio                                 = 2                                         {product} {default}
   size_t NewSize                                  = 1363144                                   {product} {default}
   size_t OldSize                                  = 5452592                                   {product} {default}
     uint ParallelGCThreads                        = 13                                        {product} {default}

The idea was to reduce the frequency of GCs and make them more efficient. It solved the old GC collection frequency issue, but the young GC started to be very high again. As consequence, again, lots of CBs and GC overhead.

Can someone suspect of anything else? Is there any other GC setting to take into consideration?

Thanks in advance.

system · February 3, 2023, 3:17pm

Elasticsearch 7.10 is EOL and no longer supported. Please upgrade ASAP.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns )

warkolm · February 5, 2023, 10:08pm

What is the output from the _cluster/stats?pretty&human API?

Also, as the bot points out you are running a very old version and you need to upgrade.

GustavoSantos · February 6, 2023, 9:27am

Hi, here it is:

{
    "_nodes": {
        "total": 6,
        "successful": 6,
        "failed": 0
    },
    "cluster_name": "es",
    "cluster_uuid": "qbo1AooITfis4t-hPhtNVA",
    "timestamp": 1675675482059,
    "status": "green",
    "indices": {
        "count": 85,
        "shards": {
            "total": 562,
            "primaries": 281,
            "replication": 1.0,
            "index": {
                "shards": {
                    "min": 2,
                    "max": 64,
                    "avg": 6.6117647058823525
                },
                "primaries": {
                    "min": 1,
                    "max": 32,
                    "avg": 3.3058823529411763
                },
                "replication": {
                    "min": 1.0,
                    "max": 1.0,
                    "avg": 1.0
                }
            }
        },
        "docs": {
            "count": 1458984322,
            "deleted": 218746906
        },
        "store": {
            "size_in_bytes": 1751768912033,
            "reserved_in_bytes": 0
        },
        "fielddata": {
            "memory_size_in_bytes": 1370917956,
            "evictions": 0
        },
        "query_cache": {
            "memory_size_in_bytes": 0,
            "total_count": 0,
            "hit_count": 0,
            "miss_count": 0,
            "cache_size": 0,
            "cache_count": 0,
            "evictions": 0
        },
        "completion": {
            "size_in_bytes": 0
        },
        "segments": {
            "count": 8794,
            "memory_in_bytes": 1612354584,
            "terms_memory_in_bytes": 1154059024,
            "stored_fields_memory_in_bytes": 4909856,
            "term_vectors_memory_in_bytes": 0,
            "norms_memory_in_bytes": 77515136,
            "points_memory_in_bytes": 0,
            "doc_values_memory_in_bytes": 375870568,
            "index_writer_memory_in_bytes": 4206496266,
            "version_map_memory_in_bytes": 1289224,
            "fixed_bit_set_memory_in_bytes": 550059416,
            "max_unsafe_auto_id_timestamp": 1675383574505,
            "file_sizes": {}
        },
        "mappings": {
            "field_types": [
                {
                    "name": "boolean",
                    "count": 303,
                    "index_count": 27
                },
                {
                    "name": "date",
                    "count": 144,
                    "index_count": 84
                },
                {
                    "name": "float",
                    "count": 136,
                    "index_count": 15
                },
                {
                    "name": "geo_point",
                    "count": 36,
                    "index_count": 36
                },
                {
                    "name": "geo_shape",
                    "count": 21,
                    "index_count": 21
                },
                {
                    "name": "join",
                    "count": 62,
                    "index_count": 62
                },
                {
                    "name": "keyword",
                    "count": 19416,
                    "index_count": 85
                },
                {
                    "name": "long",
                    "count": 14066,
                    "index_count": 85
                },
                {
                    "name": "nested",
                    "count": 470,
                    "index_count": 62
                },
                {
                    "name": "object",
                    "count": 5652,
                    "index_count": 85
                },
                {
                    "name": "text",
                    "count": 5892,
                    "index_count": 85
                }
            ]
        },
        "analysis": {
            "char_filter_types": [
                {
                    "name": "mapping",
                    "count": 80,
                    "index_count": 80
                }
            ],
            "tokenizer_types": [],
            "filter_types": [],
            "analyzer_types": [
                {
                    "name": "custom",
                    "count": 80,
                    "index_count": 80
                }
            ],
            "built_in_char_filters": [
                {
                    "name": "icu_normalizer",
                    "count": 80,
                    "index_count": 80
                }
            ],
            "built_in_tokenizers": [
                {
                    "name": "icu_tokenizer",
                    "count": 80,
                    "index_count": 80
                }
            ],
            "built_in_filters": [
                {
                    "name": "icu_folding",
                    "count": 80,
                    "index_count": 80
                }
            ],
            "built_in_analyzers": []
        }
    },
    "nodes": {
        "count": {
            "total": 6,
            "coordinating_only": 0,
            "data": 3,
            "data_cold": 3,
            "data_content": 3,
            "data_hot": 3,
            "data_warm": 3,
            "ingest": 6,
            "master": 3,
            "ml": 0,
            "remote_cluster_client": 6,
            "transform": 3,
            "voting_only": 0
        },
        "versions": [
            "7.10.2"
        ],
        "os": {
            "available_processors": 54,
            "allocated_processors": 54,
            "names": [
                {
                    "name": "Linux",
                    "count": 6
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "CentOS Linux 8",
                    "count": 6
                }
            ],
            "mem": {
                "total_in_bytes": 231928233984,
                "free_in_bytes": 11414028288,
                "used_in_bytes": 220514205696,
                "free_percent": 5,
                "used_percent": 95
            }
        },
        "process": {
            "cpu": {
                "percent": 11
            },
            "open_file_descriptors": {
                "min": 426,
                "max": 2865,
                "avg": 1636
            }
        },
        "jvm": {
            "max_uptime_in_millis": 240679452,
            "versions": [
                {
                    "version": "15.0.1",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "15.0.1+9",
                    "vm_vendor": "AdoptOpenJDK",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 6
                }
            ],
            "mem": {
                "heap_used_in_bytes": 58381384736,
                "heap_max_in_bytes": 111132278784
            },
            "threads": 443
        },
        "fs": {
            "total_in_bytes": 5629445271552,
            "free_in_bytes": 3819540131840,
            "available_in_bytes": 3819439468544
        },
        "plugins": [
            {
                "name": "repository-azure",
                "version": "7.10.2",
                "elasticsearch_version": "7.10.2",
                "java_version": "1.8",
                "description": "The Azure Repository plugin adds support for Azure storage repositories.",
                "classname": "org.elasticsearch.repositories.azure.AzureRepositoryPlugin",
                "extended_plugins": [],
                "has_native_controller": false
            },
            {
                "name": "analysis-icu",
                "version": "7.10.2",
                "elasticsearch_version": "7.10.2",
                "java_version": "1.8",
                "description": "The ICU Analysis plugin integrates the Lucene ICU module into Elasticsearch, adding ICU-related analysis components.",
                "classname": "org.elasticsearch.plugin.analysis.icu.AnalysisICUPlugin",
                "extended_plugins": [],
                "has_native_controller": false
            },
            {
                "name": "repository-s3",
                "version": "7.10.2",
                "elasticsearch_version": "7.10.2",
                "java_version": "1.8",
                "description": "The S3 repository plugin adds S3 repositories",
                "classname": "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
                "extended_plugins": [],
                "has_native_controller": false
            },
            {
                "name": "repository-gcs",
                "version": "7.10.2",
                "elasticsearch_version": "7.10.2",
                "java_version": "1.8",
                "description": "The GCS repository plugin adds Google Cloud Storage support for repositories.",
                "classname": "org.elasticsearch.repositories.gcs.GoogleCloudStoragePlugin",
                "extended_plugins": [],
                "has_native_controller": false
            }
        ],
        "network_types": {
            "transport_types": {
                "netty4": 6
            },
            "http_types": {
                "netty4": 6
            }
        },
        "discovery_types": {
            "zen": 6
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "docker",
                "count": 6
            }
        ],
        "ingest": {
            "number_of_pipelines": 0,
            "processor_stats": {}
        }
    }
}

warkolm · February 6, 2023, 9:21pm

A few things;

An exactly 32GB heap isn't great as you have likely crossed the compressed-OOPs threshold and you are better off being a bit behind below, see Advanced configuration | Elasticsearch Guide [8.6] | Elastic.
7.10 is old, as is your JVM. You should upgrade ASAP

GustavoSantos · February 7, 2023, 1:33pm

Actually, we define the heap as 31232m, which is not precisely 32gb.
I can see logs like heap size [30.5gb], compressed ordinary object pointers [true], which I suppose indicates we are under the threshold.
We are planning to upgrade the ES version, but unfortunately it will not be completed soon as it will require a big effort.
That's why we are trying to find out settings to tune for improving the cluster stability asap.

Regarding the GC config, do you think the default one is a good fit for us considering our ES and JVM versions as well as cluster size? Is there any other JVM setting you recommend to try to better configure GC?

system · March 7, 2023, 1:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Heap usage holds steady at max and GC does not run. Need to force restart the cluster Elasticsearch elastic-stack-monitoring	5	604	July 6, 2023
ElasticSearch gc performance on cluster Elasticsearch	3	654	July 5, 2017
If there are any config for garbage collection in Elastic search Elasticsearch	9	471	November 22, 2018
GC running early? Elasticsearch	3	555	May 29, 2017
Elasticsearch nodes doing young generation gc very frequently Elasticsearch	7	2749	February 11, 2019

High HEAP and CPU utilization due to long and inefficient Garbage Collector (GC)

Related topics