Hi, we have an issue when performing an upgrade on one of our clusters. We are upgrading elasticsearch from version 6.8.1 to 7.7.0 (we have other clusters on this version).
The cluster has 3 master nodes, 5 data nodes and 2 ingest(client) nodes. After we upgraded (rolling upgrade) 3 of the data nodes cluster is unstable and upgraded nodes started to leave and reconnect which results in flapping YELLOW to GREEN state. From logs on the elected master node, we can see that the reason is CircuitBreakingException
.
[2021-08-18T14:55:03,155][INFO ][o.e.c.r.a.AllocationService] [master-es11-master2] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[fb_profile_v1_2020-11][0]] ...]).
[2021-08-18T14:56:18,004][WARN ][o.e.c.r.a.AllocationService] [master-es11-master2] failing shard [failed shard, shard [fb_profile_v1_2020-11][0], node[15PSvp8ZTeieJrlSMz45zQ], [R], s[STARTED], a[id=HeYOA60wRkyfPc1YQH_qKw], message [failed to perform indices:data/write/bulk[s] on replica [fb_profile_v1_2020-11][0], node[15PSvp8ZTeieJrlSMz45zQ], [R], s[STARTED], a[id=HeYOA60wRkyfPc1YQH_qKw]], failure [RemoteTransportException[[es11-live6][10.64.10.249:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [15699093646/14.6gb], which is larger than the limit of [15627557273/14.5gb], real usage: [15699091456/14.6gb], new bytes reserved: [2190/2.1kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=2190/2.1kb, accounting=26768996/25.5mb]]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [es11-live6][10.64.10.249:9300][indices:data/write/bulk[s][r]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [15699093646/14.6gb], which is larger than the limit of [15627557273/14.5gb], real usage: [15699091456/14.6gb], new bytes reserved: [2190/2.1kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=2190/2.1kb, accounting=26768996/25.5mb]
and
[2021-08-18T14:56:18,059][WARN ][o.e.g.G.InternalReplicaShardAllocator] [master-es11-master2] [fb_profile_v1_2020-11][0]: failed to list shard for shard_store on node [15PSvp8ZTeieJrlSMz45zQ]
org.elasticsearch.action.FailedNodeException: Failed node [15PSvp8ZTeieJrlSMz45zQ]
I think the second one is responsible for nodes leaving the cluster because the master node cannot contact them due to a circuit breaker triggered on the data node.
In /nodes API we see:
"type": "failed_node_exception",
"reason": "Failed node [3g9aBFGkTQWpXduA02aOEQ]",
"node_id": "3g9aBFGkTQWpXduA02aOEQ",
"caused_by": {
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<transport_request>] would be [16540220516/15.4gb], which is larger than the limit of [15794910003/14.7gb], real usage: [16540218400/15.4gb], new bytes reserved: [2116/2kb], usages [request=0/0b, fielddata=5446/5.3kb, in_flight_requests=2116/2kb, accounting=4157692/3.9mb]",
"bytes_wanted": 16540220516,
"bytes_limit": 15794910003
when I am trying to check heap size it looks stable most of the time below 70%. This behavior still occurs after the cluster is completely upgraded. Looks stable for a couple of hours, but after that, we see that shards are unallocated and one of node missing (elasticsearch process still runs on node, but not responding to master), in few minutes node gets back to cluster.
Threadpool after node rejoins cluster:
es11-live1 analyze 0 0 0
es11-live1 ccr 0 0 0
es11-live1 fetch_shard_started 0 0 0
es11-live1 fetch_shard_store 0 0 0
es11-live1 flush 0 0 0
es11-live1 force_merge 0 0 0
es11-live1 generic 1 0 0
es11-live1 get 0 0 0
es11-live1 listener 0 0 0
es11-live1 management 1 0 0
es11-live1 ml_datafeed 0 0 0
es11-live1 ml_job_comms 0 0 0
es11-live1 ml_utility 0 0 0
es11-live1 refresh 0 0 0
es11-live1 rollup_indexing 0 0 0
es11-live1 search 1 0 0
es11-live1 search_throttled 0 0 0
es11-live1 security-token-key 0 0 0
es11-live1 snapshot 0 0 0
es11-live1 transform_indexing 0 0 0
es11-live1 warmer 0 0 0
es11-live1 watcher 0 0 0
es11-live1 write 0 0 0
Have no idea what’s causing this, and will appreciate any help. (can provide more info if needed). I know that CircuitBreakerException can happen sometimes, but why reject the master's health check and disconnect node from the cluster?
Data nodes spec: 8CPU, 30G mem, JVM heap size set to 15687M
jvm.options:
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30
-Djava.io.tmpdir=${ES_TMPDIR}
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/lib/elasticsearch
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
We use only minimal configuration changes:
cluster.name: es11
gateway.expected_nodes: 3
gateway.recover_after_time: 5m
gateway.recover_after_nodes: 2
gateway.recover_after_master_nodes: 3
discovery.seed_hosts: [ "es11-master1", "es11-master2", "es11-master3" ]
cluster.initial_master_nodes: [ "es11-master1", "es11-master2", "es11-master3" ]
node.name: es11-live1
node.master: false
node.data: true
node.attr.tag: live
path.data: /srv/elasticsearch
path.logs: /var/log/elasticsearch
action.destructive_requires_name: true
bootstrap.memory_lock: false
reindex.remote.whitelist: 0.0.0.0:9200
transport.compress: true
network.host: 10.64.10.135
http.host: 0.0.0.0
http.port: 9200
http.publish_host: 10.64.10.135
indices.recovery.max_bytes_per_sec: 200mb
indices.memory.index_buffer_size: 50%
cluster.routing.allocation.node_initial_primaries_recoveries: 20
cluster.routing.allocation.disk.threshold_enabled: false
We are also looking at "indices.memory.index_buffer_size: 50%" which is an option that we tuned in many versions before (like v2 maybe), can this be the reason for this, because we did not touch it in a new version and from docs we see that default is on 10%?
Thanks
Jiri