Master node is not able to collect garbage memory

equisde · December 5, 2023, 9:40am

The active master node in my Elasticsearch cluster is not able to collect garbage memory. All other standby master nodes are fine.

Symptoms: Heap gets increasing forever
ES Version: 7.16.3
Nodes: 3 master nodes, 12 data nodes
Things I tried:
- Change from G1GC to CMS and rolled back
- Failover master node to another one
- Update kernel
- Reinstall OS
- Heap dump (see below)
Node specs
CPU: 4 vcores
OS: CentOS7 (kernel: 3.10.0-1160.49.1.el7.x86_64)
System memory: 8GB (heap 4GB)
JVM:
openjdk version "11.0.15" 2022-04-19
OpenJDK Runtime Environment Temurin-11.0.15+10 (build 11.0.15+10)
OpenJDK 64-Bit Server VM Temurin-11.0.15+10 (build 11.0.15+10, mixed mode)
Settings
elasticsearch.yml

cluster.name: cluster-name
node.name: cluster-name-master03
node.master: true
node.data: false
network.bind_host: 0.0.0.0
network.publish_host: _eth0_
cluster.initial_master_nodes: ["ip_of_master1:9300","ip_of_master2:9300","ip_of_master3:9300"]
discovery.seed_hosts: ["ip_of_master1:9300","ip_of_master2:9300","ip_of_master3:9300"]
http.cors.enabled: true
http.cors.allow-origin: "*"
bootstrap.system_call_filter: false
thread_pool.write.queue_size: 10000
thread_pool.search.queue_size: 10000
thread_pool.search.max_queue_size: 10000
thread_pool.search.min_queue_size: 10000
cluster.routing.allocation.awareness.attributes: rack_id

jvm.options

-Xms4g
-Xmx4g

-XX:+UseG1GC
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30

-Djava.io.tmpdir=${ES_TMPDIR}

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/lib/elasticsearch

-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elastcisearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

java options

java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j2.formatMsgNoLookups=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=ALL-UNNAMED -Xms4g -Xmx4g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-2487676337507585022 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Xms4g -Xmx4g -XX:MaxDirectMemorySize=2147483648 -XX:G1HeapRegionSize=4m -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=rpm -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet

Additional info
- Shards: total 1540, primary 711
- Total data (including replicas): 540GiB
- Some indices are closed for operational purposes
- Snapshotting is running in background
- _nodes/hot_threads => shows nothing on master
- _cat/tasks => shows nothing on master

I decided to dump heap and after analyzing via Eclipse's mat I found this:

Image1. High retained heap by some classes

The report in mat says that ConcurrentHashMap$Node referenced by org.elasticsearch.gateway.GatewayAllocator occupies 95.99% bytes which looks like all data that cannot be garbage collected.

One instance of java.util.concurrent.ConcurrentHashMap$Node[] loaded by <system class loader> occupies 2,133,307,840 (95.99%) bytes. The instance is referenced by org.elasticsearch.gateway.GatewayAllocator @ 0x7031af4f0 , loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x700000000. The memory is accumulated in one instance of java.util.concurrent.ConcurrentHashMap$Node[], loaded by <system class loader>, which occupies 2,133,307,840 (95.99%) bytes.

Keywords

java.util.concurrent.ConcurrentHashMap$Node[]
jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x700000000

Is there anything I can do? Thanks in advance

Christian_Dahlqvist · December 5, 2023, 10:26am

Why do you have these settings on your master nodes? You should send indexing and search requests directly to the data nodes and make sure the dedicated master nodes do not serve any user requests so they can focus on managing the cluster.

equisde · December 5, 2023, 10:46am

Thanks for your suggestion. Those are set because of deployment simplicity. But I see that it could be removed by a simple if statement in our ansible playbook

Though, I think it isn't the cause of heap being filled up, because the master nodes are removed from the load balancer therefore they don't receive any requests.

system · January 2, 2024, 10:46am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Master node not garbage collecting Elasticsearch	2	545	July 5, 2017
Garbage Collection Not Working Elasticsearch	5	2815	October 8, 2019
Continous GC on Master Node Elasticsearch	7	864	October 4, 2018
Garbage collection issue Elasticsearch	11	576	July 4, 2019
Elasticsearch operational issue due to garbage collector Elasticsearch	7	2315	October 27, 2017

Master node is not able to collect garbage memory

Related topics