Cluster health wrong yellow spikes (because new index ?)

ebuildy · April 17, 2023, 12:04pm

We are running elasticsearch on kubernetes, via the ECK operator.

Every day we receive at least 4 alerts about elasticsearch cluster health go to yellow.

Also, we use argocd to deploy it, the health check script here argo-cd/health.lua at master · argoproj/argo-cd · GitHub set the argocd application degraged as soon as elasticsearch go to yellow state (hence give more alerts).

I am suspecting this yellow state happen when new index is created, replica shards take a little time to be allocated and kaboom.

Is it normal ? Maybe we could give a little chance to the new index before go to yellow state ?

leandrojmp · April 17, 2023, 12:16pm

This is normal, as soon as you create an index the cluster health will change to yellow until the replicas are allocated.

If for some reason it takes a little time for your cluster to allocate the replicas, the yellow status will persist during this time.

ebuildy · April 18, 2023, 3:02pm

Thanks you,

This give "wrong" alert about cluster health.

Because the index is creating, what about give a little time before changing the cluster health color?

There is no worse than receiving "normal" alerts

leandrojmp · April 18, 2023, 3:18pm

I'm not sure I agree this is wrong, the green status means that the cluster has no unassigned shards, the yellow status means that the cluster has at least one unassigned shard which can lead to data loss in case of failure of a node.

Adding a delay to change the state to yellow would probably add a lot of issues. What if someone thinks that the cluster is green and remove a node because the cluster was indeed yellow but the state was delayed and it lead to data loss? I see no reason to change this behavior.

Also, is the cluster being yellow impacting something in your Elastic Stack or just the health check script you use in argo cd?

If the impact is in your argo cd that is becoming degraded because of the yellow status, then it is better and easier to fix it in your script, maybe add a delay there to change it to degraged or check it more times to confirm that the cluster is still yellow, for example, check 3 times with a 5 seconds interval between them.

linkerc · April 18, 2023, 6:41pm

I understand your sentiment as well. It "feels" like false positive from maintenance point of view.
It's especially noticeable when you add new data nodes.
The shard rebalancing time usually takes hours, so you could have yellow indicator for hours until the new nodes are fully integrated. Replica creation for new indices waits in the action queue just like other operations.
You could tweak the "cluster_concurrent_rebalance" & "node_concurrent_recoveries" to eliminate this. But I believe the default values will cause new indices to show yellow for a prolong time when new data nodes are introduced.

I also understand the reason for showing yellow cause the potential for failure is real.
The best solution in my opinion is to have "normal" operation to have different "queues" (temporarily) so the system won't be in such condition. Just allow creating replicas for new indices to exceed the concurrent node limit.
A warning loses it's purpose if it's part of the "normal" operation, especially in this case, it's avoidable. Just allow fast pass to handle index creation.

ebuildy · April 18, 2023, 8:59pm

yes, we have edited our alert rules to reflect this.

Our biggest problem is with argocd:

github.com

argoproj/argo-cd/blob/master/resource_customizations/elasticsearch.k8s.elastic.co/Elasticsearch/health.lua#L19


      
            hs.status = "Progressing"
            hs.message = "The desired amount of availableNodes is " .. sum .. " but the current amount is " .. obj.status.availableNodes
            return hs
          elseif obj.status.availableNodes == sum then
            if obj.status.phase ~= nil and obj.status.health ~= nil then
              if obj.status.phase == "Ready" then
                if obj.status.health == "green" then
                  hs.status = "Healthy"
                  hs.message = "Elasticsearch Cluster status is Green"
                  return hs
                elseif obj.status.health == "yellow" then
                  hs.status = "Degraded"
                  hs.message = "Elasticsearch Cluster status is Yellow. Check the status of indices, replicas and shards"
                  return hs
                elseif obj.status.health == "red" then
                  hs.status = "Degraded"
                  hs.message = "Elasticsearch Cluster status is Red. Check the status of indices, replicas and shards"
                  return hs
                end
              elseif obj.status.phase == "ApplyingChanges" then
                hs.status = "Progressing"

I cannot configure a kind of time window.

leandrojmp · April 18, 2023, 10:33pm

I do not use Argo CD and I'm not sure what this do, but you will need to check in the Argo CD community.

Can you provide more context about this?

At a first look it is treating a yellow cluster and a red cluster as the same, which they aren't

dadoonet · April 18, 2023, 10:53pm

Yeah. If the yellow state is something that is bugging you, I'd just look at the red state then. Although you might miss some interesting information in the future.

But definitely, I agree that ArgoCD should handle that situation as the fact is that for some seconds, some shards are not allocated which causes the yellow state.

I remember when I was doing some monitoring using Nagios, we were able to define after how many checks we consider a warning as a real alert. I'd probably open a discussion on ArgoCD side.

system · May 16, 2023, 10:54pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
I am facing issue of cluster health as yellow becasu of some unassign shared Elasticsearch	10	1095	July 5, 2017
My cluster health showing :cluster health: yellow (16 of 26) Elasticsearch	3	348	October 25, 2018
"status" : "yellow" Elasticsearch	4	3301	July 5, 2017
Cluster yellow reason: [shards started [[.kibana][0]] Elasticsearch	2	13616	August 18, 2017
What actually causes Red / Yellow cluster health Elasticsearch	18	2628	July 6, 2017

Cluster health wrong yellow spikes (because new index ?)

Related topics