Hello
I'm running a 7.5.2 cluster. Today, while snapshots were being made (although I'm not sure it was snapshotting that triggered the issue), one of the nodes logged CircuitBreakingException and became failed for several minutes (Elasticsearch process never stopped running), which then made the cluster go yellow and broke snapshotting.
The logs:
[2020-02-25T00:08:51,031][WARN ][o.e.c.r.a.AllocationService] [elastic-logs3-p-master-2] failing shard [failed shard, shard [index-name-2020-02-25][1], node[HccOLzpQS2CKfsTNkZ9s0Q], [R], s[STARTED], a[id=nHmzdFniRDW-_5RCZtnFlg], message [failed to perform indices:data/write/bulk[s] on replica [index-name-2020-02-25][1], node[HccOLzpQS2CKfsTNkZ9s0Q], [R], s[STARTED], a[id=nHmzdFniRDW-_5RCZtnFlg]], failure [RemoteTransportException[[elastic-logs3-p-hotdata-1][172.30.1.168:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [30613028660/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30612997120/28.5gb], new bytes reserved: [31540/30.8kb], usages [request=0/0b, fielddata=53395/52.1kb, in_flight_requests=31540/30.8kb, accounting=542812105/517.6mb]]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [elastic-logs3-p-hotdata-1][172.30.1.168:9300][indices:data/write/bulk[s][r]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [30613028660/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30612997120/28.5gb], new bytes reserved: [31540/30.8kb], usages [request=0/0b, fielddata=53395/52.1kb, in_flight_requests=31540/30.8kb, accounting=542812105/517.6mb]
(...)
[2020-02-25T00:08:51,053][WARN ][o.e.g.G.InternalReplicaShardAllocator] [elastic-logs3-p-master-2] [index-name-2020-02-25][1]: failed to list shard for shard_store on node [HccOLzpQS2CKfsTNkZ9s0Q]
org.elasticsearch.action.FailedNodeException: Failed node [HccOLzpQS2CKfsTNkZ9s0Q]
(...)
Caused by: org.elasticsearch.transport.RemoteTransportException: [elastic-logs3-p-hotdata-1][172.30.1.168:9300][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [30612997414/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30612997120/28.5gb], new bytes reserved: [294/294b], usages [request=0/0b, fielddata=53395/52.1kb, in_flight_requests=294/294b, accounting=542812105/517.6mb]
The circuit breakers' values:
"breakers" : {
"request" : {
"limit_size_in_bytes" : 19327352832,
"limit_size" : "18gb",
"estimated_size_in_bytes" : 0,
"estimated_size" : "0b",
"overhead" : 1.0,
"tripped" : 0
},
"fielddata" : {
"limit_size_in_bytes" : 12884901888,
"limit_size" : "12gb",
"estimated_size_in_bytes" : 34224,
"estimated_size" : "33.4kb",
"overhead" : 1.03,
"tripped" : 0
},
"in_flight_requests" : {
"limit_size_in_bytes" : 32212254720,
"limit_size" : "30gb",
"estimated_size_in_bytes" : 4139189,
"estimated_size" : "3.9mb",
"overhead" : 2.0,
"tripped" : 0
},
"accounting" : {
"limit_size_in_bytes" : 32212254720,
"limit_size" : "30gb",
"estimated_size_in_bytes" : 326626657,
"estimated_size" : "311.4mb",
"overhead" : 1.0,
"tripped" : 0
},
"parent" : {
"limit_size_in_bytes" : 30601641984,
"limit_size" : "28.5gb",
"estimated_size_in_bytes" : 20485356536,
"estimated_size" : "19gb",
"overhead" : 1.0,
"tripped" : 2454
}
},
The heap-usage spike is visible at around 01:09
browser time (00:09
VM time):
It is my understanding that CircuitBreakingException should prevent heap-related failures. So if it was triggered to prevent the node from failure then why the node became unavailable?
The other question is what steps should I take to prevent the issue from occuring again? Increasing the heap would seem obvious, but not so much when considering it's already set to 30GB with the whole VM running on 64GB of RAM (it's a Elasticsearch-dedicated VM, no other load).
Thanks!