We are running elasticsearch on kubernetes, via the ECK operator.
Every day we receive at least 4 alerts about elasticsearch cluster health go to yellow.
Also, we use argocd to deploy it, the health check script here argo-cd/health.lua at master · argoproj/argo-cd · GitHub set the argocd application degraged as soon as elasticsearch go to yellow state (hence give more alerts).
I am suspecting this yellow state happen when new index is created, replica shards take a little time to be allocated and kaboom.
Is it normal ? Maybe we could give a little chance to the new index before go to yellow state ?
I'm not sure I agree this is wrong, the green status means that the cluster has no unassigned shards, the yellow status means that the cluster has at least one unassigned shard which can lead to data loss in case of failure of a node.
Adding a delay to change the state to yellow would probably add a lot of issues. What if someone thinks that the cluster is green and remove a node because the cluster was indeed yellow but the state was delayed and it lead to data loss? I see no reason to change this behavior.
Also, is the cluster being yellow impacting something in your Elastic Stack or just the health check script you use in argo cd?
If the impact is in your argo cd that is becoming degraded because of the yellow status, then it is better and easier to fix it in your script, maybe add a delay there to change it to degraged or check it more times to confirm that the cluster is still yellow, for example, check 3 times with a 5 seconds interval between them.
I understand your sentiment as well. It "feels" like false positive from maintenance point of view.
It's especially noticeable when you add new data nodes.
The shard rebalancing time usually takes hours, so you could have yellow indicator for hours until the new nodes are fully integrated. Replica creation for new indices waits in the action queue just like other operations.
You could tweak the "cluster_concurrent_rebalance" & "node_concurrent_recoveries" to eliminate this. But I believe the default values will cause new indices to show yellow for a prolong time when new data nodes are introduced.
I also understand the reason for showing yellow cause the potential for failure is real.
The best solution in my opinion is to have "normal" operation to have different "queues" (temporarily) so the system won't be in such condition. Just allow creating replicas for new indices to exceed the concurrent node limit.
A warning loses it's purpose if it's part of the "normal" operation, especially in this case, it's avoidable. Just allow fast pass to handle index creation.
Yeah. If the yellow state is something that is bugging you, I'd just look at the red state then. Although you might miss some interesting information in the future.
But definitely, I agree that ArgoCD should handle that situation as the fact is that for some seconds, some shards are not allocated which causes the yellow state.
I remember when I was doing some monitoring using Nagios, we were able to define after how many checks we consider a warning as a real alert. I'd probably open a discussion on ArgoCD side.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.