Cluster goes yellow abruptly

We're using elasticsearch 7.2 in production and lately we've been observing our cluster going yellow quite often even though none of the nodes left the cluster!

Whenever the cluster have gone yellow, we've seen a sudden drop in the indices.store.size_in_bytes on the problematic node. So far it has always been a single node that behaved bad. At the same time, there were a couple of 429 rejection requests (parent circuit breaker trips). Not sure if a destabilized cluster is a cause of circuit tripping or circuit tripping is the cause of node being inaccessible (note that it doesn't look like all the data is being deleted, it just drops by 300gb or so)

Regarding the cluster setup
We have a 8 core, 64GB machine, JVM heap size is 30GB. We have taken care of https://github.com/elastic/elasticsearch/pull/46169 as well. We make use of AWS NVME SSD's (which means we lose data if the instance is stopped, but in this case the instance was up, the node never left)

I really doubt if our ingestion rate is the problem (around 8k updates per minute). Last week we ran our indexing job which was ingesting around 1M per minute but that didn't destabilize the cluster

Our usecase involves a lot of regular updates & periodic batched deletes

What does the logs from the master show around that time?

@warkolm i think its a duplicate of Shards getting marked as stale frequently causing cluster to go yellow

I have been able to correlate it with the time when we are indexing / updating huge documents around 5-10mb. In GC logs I've been seeing humongous allocations

https://github.com/elastic/elasticsearch/pull/46169 is taken care of but the IHOP is still adaptive. So isn't it possible that InitiatingHeapOccupancyPercent may grow back to 70% and we face the same issue again?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.