Out Of Memory errors after G1GC migration

malcon-moreira-sm · September 14, 2021, 10:49pm

We have an Elastic environment with 83 nodes and 3 masters inside docker containers that recently were migrated from CMS Garbage Collector to G1.

The version of the cluster is 7.4.2: "docker.elastic.co/elasticsearch/elasticsearch:7.4.2", with 3440 indices and 13400 shards and we follow all the documented guidelines: Install Elasticsearch with Docker | Elasticsearch Guide [7.x] | Elastic

We use these as env variables for the nodes:

es_environment:

  - "SERVICE_IGNORE=true"

  - "ES_JAVA_OPTS=-Xms32730m -Xmx32730m"

es_command: >-

  elasticsearch

      -E bootstrap.memory_lock=true

      -E cluster.routing.use_adaptive_replica_selection=true


      -E http.compression=true

      -E indices.breaker.fielddata.limit=90%

      -E indices.breaker.request.limit=90%

      -E indices.breaker.total.limit=90%

      -E logger.level=INFO

      -E network.host={{ es_bind_ip }},{{ es_public_ip }}

      -E node.data=true

      -E node.ingest=true

      -E node.master=false

      -E search.remote.connect=false

After migrating to G1GC, OOM events have started to happen, from 3 and sometimes 4 times a week in different nodes with different indices and shards.

This is the "jvm.options" file that we use:

## JVM configuration

################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms1g
-Xmx1g

################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################

## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.
# To use G1GC uncomment the lines below.
# 10-:-XX:-UseConcMarkSweepGC
# 10-:-XX:-UseCMSInitiatingOccupancyOnly
# 10-:-XX:+UseG1GC
# 10-:-XX:G1ReservePercent=25
# 10-:-XX:InitiatingHeapOccupancyPercent=30
-XX:+UseG1GC
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30

## DNS cache policy
# cache ttl in seconds for positive DNS lookups noting that this overrides the
# JDK security property networkaddress.cache.ttl; set to -1 to cache forever
-Des.networkaddress.cache.ttl=60
# cache ttl in seconds for negative DNS lookups noting that this overrides the
# JDK security property networkaddress.cache.negative ttl; set to -1 to cache
# forever
-Des.networkaddress.cache.negative.ttl=10

## optimizations

# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

## basic

# explicitly set the stack size
-Xss1m

# set to headless, just in case
-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one
-Djna.nosys=true

# turn off a JDK optimization that throws away stack traces for common
# exceptions because stack traces are important for debugging
-XX:-OmitStackTraceInFastThrow

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Dio.netty.allocator.numDirectArenas=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=data

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=logs/hs_err_pid%p.log

## JDK 8 GC logging

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT

Can you guys please give us any insights about missing or wrongly set parameters, or maybe some other to help us prevent this behavior of G1GC?

Thank you!

warkolm · September 14, 2021, 10:59pm

Welcome to our community! We aren't all guys though

What is the output from the _cluster/stats?pretty&human API?
7.4 is EOL, can you upgrade?

malcon-moreira-sm · September 15, 2021, 12:01am

Hi @warkolm you are totally right.

{
  "_nodes" : {
    "total" : 82,
    "successful" : 82,
    "failed" : 0
  },
  "cluster_name" : "es-research-cloud",
  "cluster_uuid" : "Tnbn6gyVRUWU4p-m--4gIA",
  "timestamp" : 1630887322520,
  "status" : "yellow",
  "indices" : {
    "count" : 3302,
    "shards" : {
      "total" : 12892,
      "primaries" : 6474,
      "replication" : 0.991350015446401,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 120,
          "avg" : 3.9043004239854633
        },
        "primaries" : {
          "min" : 1,
          "max" : 60,
          "avg" : 1.9606299212598426
        },
        "replication" : {
          "min" : 0.0,
          "max" : 2.0,
          "avg" : 0.9933121340601655
        }
      }
    },
    "docs" : {
      "count" : 190691346275,
      "deleted" : 14123071412
    },
    "store" : {
      "size" : "257.4tb",
      "size_in_bytes" : 283081033649221
    },
    "fielddata" : {
      "memory_size" : "771.5gb",
      "memory_size_in_bytes" : 828408349128,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "245.2gb",
      "memory_size_in_bytes" : 263360361663,
      "total_count" : 52790536308,
      "hit_count" : 2448634476,
      "miss_count" : 50341901832,
      "cache_size" : 2190101,
      "cache_count" : 891705001,
      "evictions" : 889514900
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 311551,
      "memory" : "223.3gb",
      "memory_in_bytes" : 239813767943,
      "terms_memory" : "112.7gb",
      "terms_memory_in_bytes" : 121034671162,
      "stored_fields_memory" : "71.2gb",
      "stored_fields_memory_in_bytes" : 76453497472,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "64.8mb",
      "norms_memory_in_bytes" : 67994304,
      "points_memory" : "38gb",
      "points_memory_in_bytes" : 40891775859,
      "doc_values_memory" : "1.2gb",
      "doc_values_memory_in_bytes" : 1365829146,
      "index_writer_memory" : "3.2gb",
      "index_writer_memory_in_bytes" : 3466346404,
      "version_map_memory" : "206.7mb",
      "version_map_memory_in_bytes" : 216829898,
      "fixed_bit_set" : "79.6mb",
      "fixed_bit_set_memory_in_bytes" : 83486560,
      "max_unsafe_auto_id_timestamp" : 1630887238442,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 82,
      "coordinating_only" : 0,
      "data" : 79,
      "ingest" : 79,
      "master" : 3,
      "ml" : 82,
      "voting_only" : 0
    },
    "versions" : [
      "7.4.2"
    ],
    "os" : {
      "available_processors" : 1968,
      "allocated_processors" : 1968,
      "names" : [
        {
          "name" : "Linux",
          "count" : 82
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 82
        }
      ],
      "mem" : {
        "total" : "10tb",
        "total_in_bytes" : 11076833697792,
        "free" : "477.7gb",
        "free_in_bytes" : 512974430208,
        "used" : "9.6tb",
        "used_in_bytes" : 10563859267584,
        "free_percent" : 5,
        "used_percent" : 95
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 1274
      },
      "open_file_descriptors" : {
        "min" : 2355,
        "max" : 6472,
        "avg" : 5451
      }
    },
    "jvm" : {
      "max_uptime" : "10.4d",
      "max_uptime_in_millis" : 903373624,
      "versions" : [
        {
          "version" : "13.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "13.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 82
        }
      ],
      "mem" : {
        "heap_used" : "1.8tb",
        "heap_used_in_bytes" : 2004458201224,
        "heap_max" : "2.5tb",
        "heap_max_in_bytes" : 2814747082752
      },
      "threads" : 18192
    },
    "fs" : {
      "total" : "404.9tb",
      "total_in_bytes" : 445241552904192,
      "free" : "145.4tb",
      "free_in_bytes" : 159914219859968,
      "available" : "144.6tb",
      "available_in_bytes" : 159032375832576
    },
    "plugins" : [
      {
        "name" : "repository-s3",
        "version" : "7.4.2",
        "elasticsearch_version" : "7.4.2",
        "java_version" : "1.8",
        "description" : "The S3 repository plugin adds S3 repositories",
        "classname" : "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      }
    ],
    "network_types" : {
      "transport_types" : {
        "security4" : 82
      },
      "http_types" : {
        "security4" : 82
      }
    },
    "discovery_types" : {
      "zen" : 82
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "docker",
        "count" : 82
      }
    ]
  }
}

Unfortunately, we cannot update this cluster yet, so we have to come up with something to solve this issue by maintaining the same version.

warkolm · September 15, 2021, 3:08am

Can you check the logs from a node, any node, from when it started up and see if it mentions compressed OOPs?

Christian_Dahlqvist · September 15, 2021, 5:47am

Why have you overridden the circuit breaker limits the way you have? That looks like a likely cause of problems apart from you setting the heap size too large to benefit from compressed pointers.

malcon-moreira-sm · September 15, 2021, 10:26am

Hey @warkolm, here are the init logs from the docker image that we use:

A note: I did one local init test just to have the logs

{"type": "server", "timestamp": "2021-09-15T10:17:42,545Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "docker-cluster", "node.name": "16797c8261c9", "message": "using [1] data paths, mounts [[/ (overlay)]], net usable_space [229.1gb], net total_space [250.9gb], types [overlay]" }

{"type": "server", "timestamp": "2021-09-15T10:17:42,548Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "docker-cluster", "node.name": "16797c8261c9", "message": "heap size [1gb], compressed ordinary object pointers [true]" }

{"type": "server", "timestamp": "2021-09-15T10:17:42,551Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "16797c8261c9", "message": "node name [16797c8261c9], node ID [Dh7JZ40YRz-CsoGgG9M8TA], cluster name [docker-cluster]" }

{"type": "server", "timestamp": "2021-09-15T10:17:42,551Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "16797c8261c9", "message": "version[7.4.2], pid[1], build[default/docker/2f90bbf7b93631e52bafb59b3b049cb44ec25e96/2019-10-28T20:40:44.881551Z], OS[Linux/4.19.128-microsoft-standard/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/13.0.1/13.0.1+9]" }

{"type": "server", "timestamp": "2021-09-15T10:17:42,552Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "16797c8261c9", "message": "JVM home [/usr/share/elasticsearch/jdk]" }

{"type": "server", "timestamp": "2021-09-15T10:17:42,552Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "16797c8261c9", "message": "JVM arguments [-Xms1g, -Xmx1g, -XX:+UseG1GC, -XX:G1ReservePercent=25, -XX:InitiatingHeapOccupancyPercent=30, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch-9366315355621092329, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Djava.locale.providers=COMPAT, -Des.cgroups.hierarchy.override=/, -Dio.netty.allocator.type=unpooled, -XX:MaxDirectMemorySize=536870912, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/usr/share/elasticsearch/config, -Des.distribution.flavor=default, -Des.distribution.type=docker, -Des.bundled_jdk=true]" }

We are using OOPS, as you can see here: heap size [1gb], compressed ordinary object pointers [true]" }

This is the only parameter that differs from the CMS GC config: "-Dio.netty.allocator.type=unpooled"

When we use CMS as the default GC, this parameter automatically changes to: "-Dio.netty.allocator.type=pooled" and I don't know why

Christian_Dahlqvist · September 15, 2021, 10:35am

Is your heap size only 1 GB?

malcon-moreira-sm · September 15, 2021, 10:50am

Hey @Christian_Dahlqvist, first of all, thanks for your reply's

No, as I mentioned, I did a test by starting one small container just to catch the logs that @warkolm was asking for, Did that because in our production systems we don't have INFO logs.

Heap size is:

Christian_Dahlqvist · September 15, 2021, 10:53am

That sounds too large to benefit from compressed pointers.

malcon-moreira-sm · September 15, 2021, 11:08am

Hey @Christian_Dahlqvist, I got your point.

But we are at the edge o OOPS usage and this heap size still have oops benefits as you can see here:

[root@es-dbs-055 elasticsearch]# /usr/share/elasticsearch/jdk/bin/java -Xmx32731m -XX:+PrintFlagsFinal 2> /dev/null | grep UseCompressedOops
     bool UseCompressedOops                        = true

malcon-moreira-sm · September 15, 2021, 1:37pm

@warkolm found evidence in the logs that confirm compressed OOPS usage:

heap size [31.9gb], compressed ordinary object pointers [true]

Any thoughts about: "-Dio.netty.allocator.type=unpooled" ?

malcon-moreira-sm · September 15, 2021, 1:41pm

@Christian_Dahlqvist Now that we know about compressed OOPs, the memory pressure can be related to our breakers limits?

-E indices.breaker.fielddata.limit=90%

-E indices.breaker.request.limit=90%

-E indices.breaker.total.limit=90%

malcon-moreira-sm · September 15, 2021, 2:26pm

@Christian_Dahlqvist here is an example of the breakers:

"breakers" : {
        "request" : {
          "limit_size_in_bytes" : 30893565542,
          "limit_size" : "28.7gb",
          "estimated_size_in_bytes" : 0,
          "estimated_size" : "0b",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "fielddata" : {
          "limit_size_in_bytes" : 30893565542,
          "limit_size" : "28.7gb",
          "estimated_size_in_bytes" : 13034207228,
          "estimated_size" : "12.1gb",
          "overhead" : 1.03,
          "tripped" : 0
        },
        "in_flight_requests" : {
          "limit_size_in_bytes" : 34326183936,
          "limit_size" : "31.9gb",
          "estimated_size_in_bytes" : 254461,
          "estimated_size" : "248.4kb",
          "overhead" : 2.0,
          "tripped" : 0
        },
        "accounting" : {
          "limit_size_in_bytes" : 34326183936,
          "limit_size" : "31.9gb",
          "estimated_size_in_bytes" : 3943244512,
          "estimated_size" : "3.6gb",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "parent" : {
          "limit_size_in_bytes" : 32609874739,
          "limit_size" : "30.3gb",
          "estimated_size_in_bytes" : 28285924640,
          "estimated_size" : "26.3gb",
          "overhead" : 1.0,
          "tripped" : 119
        }
      }
    }

system · October 13, 2021, 2:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to set G1GC in docker Elasticsearch	1	727	April 3, 2020
Heap memory leak in Elasticsearch 6.2.4 Elasticsearch	5	1877	March 3, 2020
Elasticsearch G1GC tuning Memory increases continuously Elasticsearch	6	421	October 4, 2022
High HEAP and CPU utilization due to long and inefficient Garbage Collector (GC) Elasticsearch docker	6	1223	March 7, 2023
Elasticsearch 7.1x + Java 11: Possible GC misconfiguration Elasticsearch	2	562	September 30, 2019

Out Of Memory errors after G1GC migration

Related topics