Master node is not able to collect garbage memory

The active master node in my Elasticsearch cluster is not able to collect garbage memory. All other standby master nodes are fine.

  • Symptoms: Heap gets increasing forever

  • ES Version: 7.16.3

  • Nodes: 3 master nodes, 12 data nodes

  • Things I tried:

    • Change from G1GC to CMS and rolled back
    • Failover master node to another one
    • Update kernel
    • Reinstall OS
    • Heap dump (see below)
  • Node specs
    CPU: 4 vcores
    OS: CentOS7 (kernel: 3.10.0-1160.49.1.el7.x86_64)
    System memory: 8GB (heap 4GB)
    JVM:
    openjdk version "11.0.15" 2022-04-19
    OpenJDK Runtime Environment Temurin-11.0.15+10 (build 11.0.15+10)
    OpenJDK 64-Bit Server VM Temurin-11.0.15+10 (build 11.0.15+10, mixed mode)

  • Settings
    elasticsearch.yml

cluster.name: cluster-name
node.name: cluster-name-master03
node.master: true
node.data: false
network.bind_host: 0.0.0.0
network.publish_host: _eth0_
cluster.initial_master_nodes: ["ip_of_master1:9300","ip_of_master2:9300","ip_of_master3:9300"]
discovery.seed_hosts: ["ip_of_master1:9300","ip_of_master2:9300","ip_of_master3:9300"]
http.cors.enabled: true
http.cors.allow-origin: "*"
bootstrap.system_call_filter: false
thread_pool.write.queue_size: 10000
thread_pool.search.queue_size: 10000
thread_pool.search.max_queue_size: 10000
thread_pool.search.min_queue_size: 10000
cluster.routing.allocation.awareness.attributes: rack_id

jvm.options

-Xms4g
-Xmx4g

-XX:+UseG1GC
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30

-Djava.io.tmpdir=${ES_TMPDIR}

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/lib/elasticsearch

-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elastcisearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

java options

java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j2.formatMsgNoLookups=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=ALL-UNNAMED -Xms4g -Xmx4g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-2487676337507585022 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Xms4g -Xmx4g -XX:MaxDirectMemorySize=2147483648 -XX:G1HeapRegionSize=4m -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=rpm -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet
  • Additional info
    • Shards: total 1540, primary 711
    • Total data (including replicas): 540GiB
    • Some indices are closed for operational purposes
    • Snapshotting is running in background
    • _nodes/hot_threads => shows nothing on master
    • _cat/tasks => shows nothing on master

I decided to dump heap and after analyzing via Eclipse's mat I found this:

Image1. High retained heap by some classes

The report in mat says that ConcurrentHashMap$Node referenced by org.elasticsearch.gateway.GatewayAllocator occupies 95.99% bytes which looks like all data that cannot be garbage collected.

One instance of java.util.concurrent.ConcurrentHashMap$Node[] loaded by <system class loader> occupies 2,133,307,840 (95.99%) bytes. The instance is referenced by org.elasticsearch.gateway.GatewayAllocator @ 0x7031af4f0 , loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x700000000. The memory is accumulated in one instance of java.util.concurrent.ConcurrentHashMap$Node[], loaded by <system class loader>, which occupies 2,133,307,840 (95.99%) bytes.

Keywords

java.util.concurrent.ConcurrentHashMap$Node[]
jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x700000000

Is there anything I can do? Thanks in advance

Why do you have these settings on your master nodes? You should send indexing and search requests directly to the data nodes and make sure the dedicated master nodes do not serve any user requests so they can focus on managing the cluster.

Thanks for your suggestion. Those are set because of deployment simplicity. But I see that it could be removed by a simple if statement in our ansible playbook

Though, I think it isn't the cause of heap being filled up, because the master nodes are removed from the load balancer therefore they don't receive any requests.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.