We're using elasticsearch 7.2 in production and lately we've been observing our cluster going yellow quite often even though none of the nodes left the cluster!
Whenever the cluster have gone yellow, we've seen a sudden drop in the
indices.store.size_in_bytes on the problematic node. So far it has always been a single node that behaved bad. At the same time, there were a couple of 429 rejection requests (parent circuit breaker trips). Not sure if a destabilized cluster is a cause of circuit tripping or circuit tripping is the cause of node being inaccessible (note that it doesn't look like all the data is being deleted, it just drops by 300gb or so)
Regarding the cluster setup
We have a 8 core, 64GB machine, JVM heap size is 30GB. We have taken care of https://github.com/elastic/elasticsearch/pull/46169 as well. We make use of AWS NVME SSD's (which means we lose data if the instance is stopped, but in this case the instance was up, the node never left)
I really doubt if our ingestion rate is the problem (around 8k updates per minute). Last week we ran our indexing job which was ingesting around 1M per minute but that didn't destabilize the cluster
Our usecase involves a lot of regular updates & periodic batched deletes