Elasticsearch G1GC tuning

We are trying to evaluate an Elasticsearch upgrade from 5.6 to 7.8 and are seeing some unexpected memory pressure on our test 7.8 cluster.

Production Cluster:
Elasticsearch 5.6 - 12 Nodes - 3 Hot 9 Warm - 64GB RAM, 31GB JVM - Java 8 using CMS

Test Cluster:
Elasticsearch 7.8 - 12 Nodes - 3 Hot 9 Warm - 64GB RAM, 31GB JVM - Java 14 using G1GC

We have been running a test for a week or so to evaluate how our system handles Elasticsearch 7.8. We also monitor and log the heap usage for each node periodically.

This is a typical heap pattern on our production cluster. This graph reports the max heap.

Now this is what our test cluster looks like:

I did notice that the overall average heap usage has actually reduced. The concerning part is the spikes as we had been seeing the parent circuit breaking exception occur during these times.

To give a bit more context, from midnight to around 8am our systems do various things with the data in Elasticsearch. A BI system extracts data using scrolls and a few other programs run some aggregations. We had to tweak the aggregations because of the 10,000 bucket limit so these actually run longer with 7.8.

Our production 5.6 cluster has the same pattern of running systems extracting and aggregating but we see no spikies at all during this time.

There are no real errors that occur with our test cluster apart from increased memory spikes.

I have been checking the gc logs for any clues. Using GCeasy to generate some reports:
Test Warm Node:



Test Hot Node:


(An overactive set of aggregations causes that spike in heap after gc on the hot node)

Wondering if any GC tuning can be done to reduce the heap spikes.

GCEasy recommended increasing G1HeapRegionSize and setting UseStringDeduplication. I have done that to evaluate any improvements. Haven't seen anything yet.

################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################

## GC configuration
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
# 10-13:-XX:-UseConcMarkSweepGC
# 10-13:-XX:-UseCMSInitiatingOccupancyOnly
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30
14-:-XX:G1HeapRegionSize=16M
14-:-XX:+UseStringDeduplication

## JVM temporary directory
-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=/var/lib/elasticsearch

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=/var/log/elasticsearch/hs_err-pid%p.log

## JDK 8 GC logging
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

Any other G1GC tuning we could do?

Thanks

Have you made sure your heap size is set to a value below the compressed pointers threshold?

What is the full output of the cluster stats API?

Currently set to 31GB on each node on the test cluster. Same on the production cluster.

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms31g
-Xmx31g

Cluster Stats Test Cluster

{
  "_nodes" : {
    "total" : 12,
    "successful" : 12,
    "failed" : 0
  },
  "cluster_name" : "dev-tmx-hds",
  "cluster_uuid" : "-pOWSGXhTWi9NZUBiB9s5g",
  "timestamp" : 1604291976655,
  "status" : "green",
  "indices" : {
    "count" : 242,
    "shards" : {
      "total" : 992,
      "primaries" : 498,
      "replication" : 0.9919678714859438,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 6,
          "avg" : 4.099173553719008
        },
        "primaries" : {
          "min" : 1,
          "max" : 3,
          "avg" : 2.0578512396694215
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.9917355371900827
        }
      }
    },
    "docs" : {
      "count" : 31100539785,
      "deleted" : 3958738
    },
    "store" : {
      "size" : "7.8tb",
      "size_in_bytes" : 8599139888353
    },
    "fielddata" : {
      "memory_size" : "1kb",
      "memory_size_in_bytes" : 1112,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "873mb",
      "memory_size_in_bytes" : 915427615,
      "total_count" : 23234704,
      "hit_count" : 1668621,
      "miss_count" : 21566083,
      "cache_size" : 14888,
      "cache_count" : 32877,
      "evictions" : 17989
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 11324,
      "memory" : "119.5mb",
      "memory_in_bytes" : 125409112,
      "terms_memory" : "32.2mb",
      "terms_memory_in_bytes" : 33863936,
      "stored_fields_memory" : "66.5mb",
      "stored_fields_memory_in_bytes" : 69818480,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "60.1kb",
      "norms_memory_in_bytes" : 61568,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "20.6mb",
      "doc_values_memory_in_bytes" : 21665128,
      "index_writer_memory" : "231.6mb",
      "index_writer_memory_in_bytes" : 242907040,
      "version_map_memory" : "14.7mb",
      "version_map_memory_in_bytes" : 15436454,
      "fixed_bit_set" : "794mb",
      "fixed_bit_set_memory_in_bytes" : 832663232,
      "max_unsafe_auto_id_timestamp" : 1604040473082,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "boolean",
          "count" : 498,
          "index_count" : 171
        },
        {
          "name" : "byte",
          "count" : 12,
          "index_count" : 3
        },
        {
          "name" : "date",
          "count" : 724,
          "index_count" : 241
        },
        {
          "name" : "double",
          "count" : 414,
          "index_count" : 107
        },
        {
          "name" : "float",
          "count" : 767,
          "index_count" : 36
        },
        {
          "name" : "integer",
          "count" : 679,
          "index_count" : 72
        },
        {
          "name" : "keyword",
          "count" : 2160,
          "index_count" : 242
        },
        {
          "name" : "long",
          "count" : 16,
          "index_count" : 3
        },
        {
          "name" : "nested",
          "count" : 9,
          "index_count" : 9
        },
        {
          "name" : "object",
          "count" : 452,
          "index_count" : 210
        },
        {
          "name" : "short",
          "count" : 1197,
          "index_count" : 174
        },
        {
          "name" : "text",
          "count" : 42,
          "index_count" : 6
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 12,
      "coordinating_only" : 0,
      "data" : 12,
      "ingest" : 12,
      "master" : 3,
      "remote_cluster_client" : 12
    },
    "versions" : [
      "7.8.1"
    ],
    "os" : {
      "available_processors" : 96,
      "allocated_processors" : 96,
      "names" : [
        {
          "name" : "Linux",
          "count" : 12
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Ubuntu 18.04.5 LTS",
          "count" : 12
        }
      ],
      "mem" : {
        "total" : "746.4gb",
        "total_in_bytes" : 801470939136,
        "free" : "9gb",
        "free_in_bytes" : 9671577600,
        "used" : "737.4gb",
        "used_in_bytes" : 791799361536,
        "free_percent" : 1,
        "used_percent" : 99
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 77
      },
      "open_file_descriptors" : {
        "min" : 605,
        "max" : 2579,
        "avg" : 1100
      }
    },
    "jvm" : {
      "max_uptime" : "2.9d",
      "max_uptime_in_millis" : 251925364,
      "versions" : [
        {
          "version" : "14.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "14.0.1+7",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 12
        }
      ],
      "mem" : {
        "heap_used" : "157.2gb",
        "heap_used_in_bytes" : 168887219840,
        "heap_max" : "372gb",
        "heap_max_in_bytes" : 399431958528
      },
      "threads" : 1036
    },
    "fs" : {
      "total" : "52.5tb",
      "total_in_bytes" : 57776236412928,
      "free" : "44.6tb",
      "free_in_bytes" : 49047599960064,
      "available" : "41.9tb",
      "available_in_bytes" : 46132190859264
    },
    "plugins" : [
      {
        "name" : "repository-s3",
        "version" : "7.8.1",
        "elasticsearch_version" : "7.8.1",
        "java_version" : "1.8",
        "description" : "The S3 repository plugin adds S3 repositories",
        "classname" : "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      }
    ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 12
      },
      "http_types" : {
        "netty4" : 12
      }
    },
    "discovery_types" : {
      "zen" : 12
    },
    "packaging_types" : [
      {
        "flavor" : "oss",
        "type" : "deb",
        "count" : 12
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 0,
      "processor_stats" : { }
    }
  }
}

Looking though our gc logs I can see:

 Heap address: 0x0000001001000000, size: 31744 MB, Compressed Oops mode: Non-zero disjoint base: 0x0000001000000000, Oop shift amount: 3

Does this indicate that it is not using zero based compressed oops? Should we reduce the heap size?

I have set the test cluster jvm heap to 30.5g. I can see it is now using zero based compressed oops.

Heap address: 0x000000008d000000, size: 30512 MB, Compressed Oops mode: Zero based, Oop shift amount: 3

Based on a few initial tests using long running scrolls. I can see still heap spikes occurring but the highest I have seen so far is 26g.

I suggest upgrading to 7.9.3 to get the benefits of https://github.com/elastic/elasticsearch/pull/58674, which should help avoid the CircuitBreakingExceptions you mentioned in your OP. I don't think any other tuning is generally recommended.

The graph of heap usage over time with G1GC typically looks quite different from the ones with CMS, but that in itself isn't anything to worry about.

Rather than adjusting jvm.options directly you should drop any changes into separate files in the jvm.options.d directory.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.