Hello again, my new topic about our problem.
We have ES cluster with 3 nodes. Each 128 Gb RAM, 31 GB Heap, there our jvm.options:
-Xms31g
-Xmx31g
## GC configuration
#-XX:+UseConcMarkSweepGC
#-XX:CMSInitiatingOccupancyFraction=90
#-XX:+UseCMSInitiatingOccupancyOnly
## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
10-13:-XX:-UseConcMarkSweepGC
10-13:-XX:-UseCMSInitiatingOccupancyOnly
11-:-XX:+UseG1GC
11-:-XX:G1ReservePercent=25
11-:-XX:InitiatingHeapOccupancyPercent=30
Sometimes we get warnings about gc overhead:
[2020-12-25T10:28:29,054][DEBUG][o.e.m.j.JvmGcMonitorService] [h1-es01] [gc][young][4544][185] duration [624ms], collections [1]/[1s], total [624ms]/[35.3s], memory [22.9gb]->[7.9gb]/[31gb], all_pools {[young] [15gb]->[48mb]/[0b]}{[old]
[7.7gb]->[7.7gb]/[31gb]}{[survivor] [122.4mb]->[160.5mb]/[0b]}
[2020-12-25T10:28:29,055][WARN ][o.e.m.j.JvmGcMonitorService] [h1-es01] [gc][4544] overhead, spent [624ms] collecting in the last [1s]
[2020-12-25T10:28:34,442][INFO ][o.e.m.j.JvmGcMonitorService] [h1-es01] [gc][4549] overhead, spent [325ms] collecting in the last [1.2s]
Around the same time we also get follower checker errors this from h1-es01:
[2020-12-25T10:29:26,373][DEBUG][o.e.c.c.LeaderChecker ] [h1-es01] 1 consecutive failures (limit [cluster.fault_detection.leader_check.retry_count] is 3) with leader [{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{ciFEpbFAQyyUlwd-Lv4Kxw}{h1-es02ip}{h1-es02ip:9300}{dimr}]
org.elasticsearch.transport.RemoteTransportException: [h1-es02][h1-es02ip:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{zHhOoPfiTSeHEIwhyBgNpA}{h1-es01ip}{h1-es01ip:9300}{dimr}] has been removed from the cluster
This time on h1-es02:
[2020-12-25T10:29:24,384][DEBUG][o.e.c.c.FollowersChecker ] [h1-es02] FollowerChecker{discoveryNode={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{zHhOoPfiTSeHEIwhyBgNpA}{h1-es01ip}{h1-es01ip:9300}{dimr}, failureCountSinceLastSuccess=3, [cl
uster.fault_detection.follower_check.retry_count]=3} failed too many times
org.elasticsearch.transport.ReceiveTimeoutTransportException: [h1-es01][h1-es01ip:9300][internal:coordination/fault_detection/follower_check] request_id [329372] timed out after [10006ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1074) [elasticsearch-7.9.1.jar:7.9.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
[2020-12-25T10:29:24,385][DEBUG][o.e.c.c.FollowersChecker ] [h1-es02] FollowerChecker{discoveryNode={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{zHhOoPfiTSeHEIwhyBgNpA}{h1-es01ip}{h1-es01ip:9300}{dimr}, failureCountSinceLastSuccess=3, [cl
uster.fault_detection.follower_check.retry_count]=3} marking node as faulty
[2020-12-25T10:29:24,385][DEBUG][o.e.c.s.MasterService ] [h1-es02] executing cluster state update for [node-left[{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{zHhOoPfiTSeHEIwhyBgNpA}{h1-es01ip}{h1-es01ip:9300}{dimr} reason: followers check retry count exceeded]]
Heap graph collected by zabbix for h1-es01 looks like:
Where region 1 we don't writing to cluster and region 2 writing to cluster (maybe not perfectly divided on picture)
Please, help us to understand what going on, maybe or cluster is just too weak for such performance?
It may occur on each node, and usually 1-3 times per day.