Cluster health wrong yellow spikes (because new index ?)

We are running elasticsearch on kubernetes, via the ECK operator.

Every day we receive at least 4 alerts about elasticsearch cluster health go to yellow.

Also, we use argocd to deploy it, the health check script here argo-cd/health.lua at master · argoproj/argo-cd · GitHub set the argocd application degraged as soon as elasticsearch go to yellow state (hence give more alerts).

I am suspecting this yellow state happen when new index is created, replica shards take a little time to be allocated and kaboom.

Is it normal ? Maybe we could give a little chance to the new index before go to yellow state ?

1 Like

This is normal, as soon as you create an index the cluster health will change to yellow until the replicas are allocated.

If for some reason it takes a little time for your cluster to allocate the replicas, the yellow status will persist during this time.

Thanks you,

This give "wrong" alert about cluster health.

Because the index is creating, what about give a little time before changing the cluster health color?

There is no worse than receiving "normal" alerts :slight_smile:

I'm not sure I agree this is wrong, the green status means that the cluster has no unassigned shards, the yellow status means that the cluster has at least one unassigned shard which can lead to data loss in case of failure of a node.

Adding a delay to change the state to yellow would probably add a lot of issues. What if someone thinks that the cluster is green and remove a node because the cluster was indeed yellow but the state was delayed and it lead to data loss? I see no reason to change this behavior.

Also, is the cluster being yellow impacting something in your Elastic Stack or just the health check script you use in argo cd?

If the impact is in your argo cd that is becoming degraded because of the yellow status, then it is better and easier to fix it in your script, maybe add a delay there to change it to degraged or check it more times to confirm that the cluster is still yellow, for example, check 3 times with a 5 seconds interval between them.

1 Like

I understand your sentiment as well. It "feels" like false positive from maintenance point of view.
It's especially noticeable when you add new data nodes.
The shard rebalancing time usually takes hours, so you could have yellow indicator for hours until the new nodes are fully integrated. Replica creation for new indices waits in the action queue just like other operations.
You could tweak the "cluster_concurrent_rebalance" & "node_concurrent_recoveries" to eliminate this. But I believe the default values will cause new indices to show yellow for a prolong time when new data nodes are introduced.

I also understand the reason for showing yellow cause the potential for failure is real.
The best solution in my opinion is to have "normal" operation to have different "queues" (temporarily) so the system won't be in such condition. Just allow creating replicas for new indices to exceed the concurrent node limit.
A warning loses it's purpose if it's part of the "normal" operation, especially in this case, it's avoidable. Just allow fast pass to handle index creation.

1 Like

yes, we have edited our alert rules to reflect this.

Our biggest problem is with argocd:

I cannot configure a kind of time window.

I do not use Argo CD and I'm not sure what this do, but you will need to check in the Argo CD community.

Can you provide more context about this?

At a first look it is treating a yellow cluster and a red cluster as the same, which they aren't

1 Like

Yeah. If the yellow state is something that is bugging you, I'd just look at the red state then. Although you might miss some interesting information in the future.

But definitely, I agree that ArgoCD should handle that situation as the fact is that for some seconds, some shards are not allocated which causes the yellow state.

I remember when I was doing some monitoring using Nagios, we were able to define after how many checks we consider a warning as a real alert. I'd probably open a discussion on ArgoCD side.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.