Hi,
I have a 2-node Elastic cluster, each node has 6GB RAM and 2CPUs. I've assigned a max heap of 2GB to ElasticSearch. There's a tiny amount of data in there as I'm still trying to get it set up properly.
At least once a day 1 of the indexes gets marked yellow and I can see that at least 1 shard isn't being assigned because of a triggered circuit breaker. I'll just post part of the output of /_cluster/allocation/explain?pretty as it seems to have all the info:
"index" : "logstash-2019.08.16",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2019-08-16T00:11:19.260Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [1S5LcKcjT2mCPSo31Onlfg]: failed recovery, failure RecoveryFailedException[[logstash-2019.08.16][0]: Recovery failed from {infra-elk01-inst-a}{BBu_2gYXQn2KXe8izUYs1Q}{OJMhGvE-ThCtBpqZ0FVGjQ}{<node01>}{<node01>:9300}{dim}{ml.machine_memory=6085251072, ml.max_open_jobs=20, xpack.installed=true} into {infra-elk02-inst-a}{1S5LcKcjT2mCPSo31Onlfg}{bpmtxcHPS96u0HxqU4qiSw}{<node02>}{<node02>:9300}{dim}{ml.machine_memory=6085246976, xpack.installed=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[infra-elk01-inst-a][<:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[infra-elk02-inst-a][<node02>:9300][internal:index/shard/recovery/prepare_translog]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [2101155290/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2101154736/1.9gb], new bytes reserved: [554/554b], usages [request=0/0b, fielddata=1928/1.8kb, in_flight_requests=554/554b, accounting=411900/402.2kb]]; ",
"last_allocation_status" : "no_attempt"
I've also noticed the heap usage seems to climb very high until it also triggers a circuit breaker just by checking the node status.
root@infra-elk01.sac1:~# curl -u jparker -s 'http://localhost:9200/_cat/nodes?v'
Enter host password for user 'jparker':
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
<node01-ip> 87 95 13 0.16 0.23 0.18 dim * infra-elk01-inst-a
<node02-ip> 13 92 3 0.14 0.15 0.20 dim - infra-elk02-inst-a
root@infra-elk01.sac1:~# curl -u jparker -s 'http://localhost:9200/_cat/nodes?v'
Enter host password for user 'jparker':
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [2092117976/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2092117976/1.9gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1804/1.7kb, in_flight_requests=0/0b, accounting=421354/411.4kb]","bytes_wanted":2092117976,"bytes_limit":2040109465,"durability":"PERMANENT"}],"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [2092117976/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2092117976/1.9gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1804/1.7kb, in_flight_requests=0/0b, accounting=421354/411.4kb]","bytes_wanted":2092117976,"bytes_limit":2040109465,"durability":"PERMANENT"},"status":429}
If I force it to re-assign with [/_cluster/reroute?retry_failed=true] it all goes green again.
There's only a very tiny amount of data on this cluster (currently just 5 indices and they're only a few mb in size each) so I don't understand how it can use this much heap.
I can add more heap if that will resolve the issue but I'm concerned the same thing will keep happening.
Here's the typical heap usage when it's running normally:
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
<node01-ip> 17 95 5 0.27 0.15 0.14 dim * infra-elk01-inst-a
<node02-ip> 14 92 5 0.62 0.20 0.16 dim - infra-elk02-inst-a
Any ideas?
Edit: I managed to get monitoring of the cluster working and noticed this (on elk01 only):
Thanks,
Jeremy