Heap usage holds steady at max and GC does not run. Need to force restart the cluster

Cluster configuration: 3 Nodes

GET _nodes/_all/jvm
Elasticsearch version: 7.10.1
Java version: 15.0.1
gc_collectors: [ "G1 Young Generation", "G1 Old Generation"]
heap_max_in_bytes: 30064771072 (30 GB)

Circuit breaker configs: (defaults)
GET _cluster/settings?include_defaults=true
total.limit: 95%
use_real_memory: true

--
Problem:

Our cluster blocks/back-pressures indexing (write) requests, and clients error out with ConnectionTimeout exceptions. This happens a couple of times every day with no obvious pattern other than only write clients fail (readers seem unaffected), and the only solution we have is to restart the nodes. If we do not restart them, the cluster remains in this "partially available" state and indexers keep failing.

We inspected GC logs and it seems the garbage collector (G1) runs very rarely, and when it does it runs quick (spends 300 ms in 1000 ms observation periods). However, the cluster's memory usage is always close to maximum (30 GB, with minimal oscillations); at some point the heap usage holds completely flat at maximum and we need to restart the cluster.

Each node has 64 GB of memory, and we have allocated 28 GB for JVM (per recommendations of keeping it under 50% of capacity and under 32 GBs).

--
Other info:

  • We originally encountered this error last month, and we thought it was a matter of it being under-resourced for the workloads so we doubled the JVM capacity from 14GB to 28 GB, but it made no difference.
  • CPU usage is low, between 10 - 20%.
  • No circuit breakers kicking in
  • No major slow queries in slow query logs
  • 841 total shards (well under the recommended maximum of sum(nodes.max_heap) * 20) and close to 2 TBs of data.

--

Our current hypothesis is that the GC is not acting on time, but unsure what exactly is the issue.

Thanks for the help in advance

What is the full output of the cluster stats API?

I see you are running a very old version that has been EOL for quite some time. I would recommend you upgrade to at least 7.17.

@Christian_Dahlqvist here is the full output of cluster stats API

{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "aname",
  "cluster_uuid" : "anid",
  "timestamp" : 1685970609683,
  "status" : "green",
  "indices" : {
    "count" : 403,
    "shards" : {
      "total" : 897,
      "primaries" : 433,
      "replication" : 1.071593533487298,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 15,
          "avg" : 2.225806451612903
        },
        "primaries" : {
          "min" : 1,
          "max" : 5,
          "avg" : 1.0744416873449132
        },
        "replication" : {
          "min" : 1.0,
          "max" : 2.0,
          "avg" : 1.0173697270471465
        }
      }
    },
    "docs" : {
      "count" : 13374569927,
      "deleted" : 34218233
    },
    "store" : {
      "size_in_bytes" : 7868171841856,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 37041304,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 66597175,
      "total_count" : 38320018,
      "hit_count" : 5671389,
      "miss_count" : 32648629,
      "cache_size" : 366666,
      "cache_count" : 367134,
      "evictions" : 468
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 15607,
      "memory_in_bytes" : 641781036,
      "terms_memory_in_bytes" : 485175136,
      "stored_fields_memory_in_bytes" : 11308760,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 65740288,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 79556852,
      "index_writer_memory_in_bytes" : 3589432458,
      "version_map_memory_in_bytes" : 37316230,
      "fixed_bit_set_memory_in_bytes" : 612932576,
      "max_unsafe_auto_id_timestamp" : 1685882997431,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "binary",
          "count" : 37,
          "index_count" : 7
        },
        {
          "name" : "boolean",
          "count" : 2538,
          "index_count" : 228
        },
        {
          "name" : "byte",
          "count" : 28,
          "index_count" : 28
        },
        {
          "name" : "constant_keyword",
          "count" : 3,
          "index_count" : 1
        },
        {
          "name" : "date",
          "count" : 2555,
          "index_count" : 400
        },
        {
          "name" : "date_nanos",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "date_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "double",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "double_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "flattened",
          "count" : 27,
          "index_count" : 3
        },
        {
          "name" : "float",
          "count" : 1200,
          "index_count" : 145
        },
        {
          "name" : "float_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "geo_point",
          "count" : 50,
          "index_count" : 8
        },
        {
          "name" : "geo_shape",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "half_float",
          "count" : 9,
          "index_count" : 3
        },
        {
          "name" : "integer",
          "count" : 138,
          "index_count" : 26
        },
        {
          "name" : "integer_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "ip",
          "count" : 93,
          "index_count" : 9
        },
        {
          "name" : "ip_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "keyword",
          "count" : 35869,
          "index_count" : 402
        },
        {
          "name" : "long",
          "count" : 9306,
          "index_count" : 309
        },
        {
          "name" : "long_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "nested",
          "count" : 97,
          "index_count" : 30
        },
        {
          "name" : "object",
          "count" : 15208,
          "index_count" : 289
        },
        {
          "name" : "shape",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "short",
          "count" : 56,
          "index_count" : 29
        },
        {
          "name" : "text",
          "count" : 31010,
          "index_count" : 397
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [
        {
          "name" : "pattern_capture",
          "count" : 2,
          "index_count" : 2
        }
      ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 2,
          "index_count" : 2
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "uax_url_email",
          "count" : 2,
          "index_count" : 2
        }
      ],
      "built_in_filters" : [
        {
          "name" : "lowercase",
          "count" : 2,
          "index_count" : 2
        },
        {
          "name" : "unique",
          "count" : 2,
          "index_count" : 2
        }
      ],
      "built_in_analyzers" : [ ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 3,
      "coordinating_only" : 0,
      "data" : 3,
      "data_cold" : 3,
      "data_content" : 3,
      "data_hot" : 3,
      "data_warm" : 3,
      "ingest" : 3,
      "master" : 3,
      "ml" : 3,
      "remote_cluster_client" : 3,
      "transform" : 3,
      "voting_only" : 0
    },
    "versions" : [
      "7.10.1"
    ],
    "os" : {
      "available_processors" : 48,
      "allocated_processors" : 48,
      "names" : [
        {
          "name" : "Linux",
          "count" : 3
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Amazon Linux 2",
          "count" : 3
        }
      ],
      "mem" : {
        "total_in_bytes" : 199429971968,
        "free_in_bytes" : 1609449472,
        "used_in_bytes" : 197820522496,
        "free_percent" : 1,
        "used_percent" : 99
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 25
      },
      "open_file_descriptors" : {
        "min" : 4737,
        "max" : 5159,
        "avg" : 4901
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 224947021,
      "versions" : [
        {
          "version" : "15.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 3
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 59089259664,
        "heap_max_in_bytes" : 90194313216
      },
      "threads" : 771
    },
    "fs" : {
      "total_in_bytes" : 10305942859776,
      "free_in_bytes" : 2418963689472,
      "available_in_bytes" : 2418963689472
    },
    "plugins" : [
      {
        "name" : "repository-s3",
        "version" : "7.10.1",
        "elasticsearch_version" : "7.10.1",
        "java_version" : "1.8",
        "description" : "The S3 repository plugin adds S3 repositories",
        "classname" : "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      }
    ],
    "network_types" : {
      "transport_types" : {
        "security4" : 3
      },
      "http_types" : {
        "security4" : 3
      }
    },
    "discovery_types" : {
      "zen" : 3
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "rpm",
        "count" : 3
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 12,
      "processor_stats" : {
        "conditional" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "geoip" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "set" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        }
      }
    }
  }
}

I do not see anything immediately wrong there. Do you have any non-default cluster settings that could impact heap usage? What kind of load is the cluster under?

Hey Christian, thank you for the quick reply. Looking at our configuration, we noticed we had none of the GC configurations set up.

GET /_nodes/jvm?pretty showed

   "input_arguments" : [
          "-Xshare:auto",
          "-Des.networkaddress.cache.ttl=60",
          "-Des.networkaddress.cache.negative.ttl=10",
          "-XX:+AlwaysPreTouch",
          "-Xss1m",
          "-Djava.awt.headless=true",
          "-Dfile.encoding=UTF-8",
          "-Djna.nosys=true",
          "-XX:-OmitStackTraceInFastThrow",
          "-XX:+ShowCodeDetailsInExceptionMessages",
          "-Dio.netty.noUnsafe=true",
          "-Dio.netty.noKeySetOptimization=true",
          "-Dio.netty.recycler.maxCapacityPerThread=0",
          "-Dio.netty.allocator.numDirectArenas=0",
          "-Dlog4j.shutdownHookEnabled=false",
          "-Dlog4j2.disable.jmx=true",
          "-Djava.locale.providers=SPI,COMPAT",
          "-Xms28g",
          "-Xmx28g"
]

We went ahead and copied the default configs from elasticsearch/jvm.options at 7.10 · elastic/elasticsearch · GitHub

Our JVM configs are now the following, and the cluster seems to be doing much better.

   "input_arguments" : [
          "-Xshare:auto",
          "-Des.networkaddress.cache.ttl=60",
          "-Des.networkaddress.cache.negative.ttl=10",
          "-XX:+AlwaysPreTouch",
          "-Xss1m",
          "-Djava.awt.headless=true",
          "-Dfile.encoding=UTF-8",
          "-Djna.nosys=true",
          "-XX:-OmitStackTraceInFastThrow",
          "-XX:+ShowCodeDetailsInExceptionMessages",
          "-Dio.netty.noUnsafe=true",
          "-Dio.netty.noKeySetOptimization=true",
          "-Dio.netty.recycler.maxCapacityPerThread=0",
          "-Dio.netty.allocator.numDirectArenas=0",
          "-Dlog4j.shutdownHookEnabled=false",
          "-Dlog4j2.disable.jmx=true",
          "-Djava.locale.providers=SPI,COMPAT",
          "-Xms28g",
          "-Xmx28g",
          "-XX:+UseG1GC",
          "-XX:G1ReservePercent=25",
          "-XX:InitiatingHeapOccupancyPercent=30",
          "-XX:+HeapDumpOnOutOfMemoryError",
          "-XX:+HeapDumpBeforeFullGC",
          "-XX:HeapDumpPath=/var/log/elasticsearch",
          "-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m",
          "-XX:MaxDirectMemorySize=15032385536",
          "-Des.path.home=/usr/share/elasticsearch",
          "-Des.path.conf=/etc/elasticsearch",
          "-Des.distribution.flavor=default",
          "-Des.distribution.type=rpm",
          "-Des.bundled_jdk=true"
        ]

Looking at the convo in Garbage Collection Not Working - #5 by danielmitterdorfer

I imagine our cluster was running with a GC implicitly chosen by the JVM, which was clearly not running as needed.

Again, thanks for the help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.