The active master node in my Elasticsearch cluster is not able to collect garbage memory. All other standby master nodes are fine.
-
Symptoms: Heap gets increasing forever
-
ES Version: 7.16.3
-
Nodes: 3 master nodes, 12 data nodes
-
Things I tried:
- Change from G1GC to CMS and rolled back
- Failover master node to another one
- Update kernel
- Reinstall OS
- Heap dump (see below)
-
Node specs
CPU: 4 vcores
OS: CentOS7 (kernel: 3.10.0-1160.49.1.el7.x86_64)
System memory: 8GB (heap 4GB)
JVM:
openjdk version "11.0.15" 2022-04-19
OpenJDK Runtime Environment Temurin-11.0.15+10 (build 11.0.15+10)
OpenJDK 64-Bit Server VM Temurin-11.0.15+10 (build 11.0.15+10, mixed mode) -
Settings
elasticsearch.yml
cluster.name: cluster-name
node.name: cluster-name-master03
node.master: true
node.data: false
network.bind_host: 0.0.0.0
network.publish_host: _eth0_
cluster.initial_master_nodes: ["ip_of_master1:9300","ip_of_master2:9300","ip_of_master3:9300"]
discovery.seed_hosts: ["ip_of_master1:9300","ip_of_master2:9300","ip_of_master3:9300"]
http.cors.enabled: true
http.cors.allow-origin: "*"
bootstrap.system_call_filter: false
thread_pool.write.queue_size: 10000
thread_pool.search.queue_size: 10000
thread_pool.search.max_queue_size: 10000
thread_pool.search.min_queue_size: 10000
cluster.routing.allocation.awareness.attributes: rack_id
jvm.options
-Xms4g
-Xmx4g
-XX:+UseG1GC
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30
-Djava.io.tmpdir=${ES_TMPDIR}
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/lib/elasticsearch
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elastcisearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
java options
java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j2.formatMsgNoLookups=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=ALL-UNNAMED -Xms4g -Xmx4g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-2487676337507585022 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Xms4g -Xmx4g -XX:MaxDirectMemorySize=2147483648 -XX:G1HeapRegionSize=4m -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=rpm -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet
- Additional info
- Shards: total 1540, primary 711
- Total data (including replicas): 540GiB
- Some indices are closed for operational purposes
- Snapshotting is running in background
- _nodes/hot_threads => shows nothing on master
- _cat/tasks => shows nothing on master
I decided to dump heap and after analyzing via Eclipse's mat I found this:
Image1. High retained heap by some classes
The report in mat says that ConcurrentHashMap$Node referenced by org.elasticsearch.gateway.GatewayAllocator occupies 95.99% bytes which looks like all data that cannot be garbage collected.
One instance of java.util.concurrent.ConcurrentHashMap$Node[] loaded by <system class loader> occupies 2,133,307,840 (95.99%) bytes. The instance is referenced by org.elasticsearch.gateway.GatewayAllocator @ 0x7031af4f0 , loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x700000000. The memory is accumulated in one instance of java.util.concurrent.ConcurrentHashMap$Node[], loaded by <system class loader>, which occupies 2,133,307,840 (95.99%) bytes.
Keywords
java.util.concurrent.ConcurrentHashMap$Node[]
jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x700000000
Is there anything I can do? Thanks in advance