Elasticsearch cluster health intermittently flaps between 'GREEN' and 'YELLOW'

Elasticsearch version = 5.6.8

We are running a 7 node cluster with "ZERO" replicas, like this:

{
"cluster_name": "my_cluster",
"status": "green",
"timed_out": false,
"number_of_nodes": 7,
"number_of_data_nodes": 7,
"active_primary_shards": 3325,
"active_shards": 3325,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 100.0
}
elasticsearch cluster state changes from "Green" to "Yellow" intermittently. The other interesting thing I noticed was during this intermittent cluster state changes, there is shard initializing taking place, which correlates with the cluster state changes. Is this due to the cluster running with "ZERO" replicas? What could cause the above behavior ?

What do your logs show?
Why are you running with no replicas?

We have a stand by cluster. So we made decision not have replicas. Both the primary and stand by have similar behavior. The logs do not have much information regarding the cluster state flapping. We haven't seen any out of memory errors either related to JVM heap size.

YELLOW is a funny colour for a cluster with no replicas, because it means that all the primaries are assigned but there are unassigned replicas.

Please could you share the actual log messages that you're seeing? Without them all we can do is speculate. The answer may be contained in messages that do not look relevant at first sight, so don't hold back.

Is there a way I can drop the logs on this forum. I don't see an option to drop a file.

You can paste them in, or use gist/pastebin/etc and link them.

Please format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you :slight_smile:

1 Like

I have shared the logs via google drive. Here is the link. Thanks in advance to all you for your help.

The events where you see dropped metrics is where I found the cluster health flapping for brief moment to yellow state and then turn to Green. There is no data or service loss currently.

I can't see any indication of changes in the cluster health in these logs, but there's only the one log file (from lca1-app1361.corp.rapidms.com). Was this the elected master at the time? If not can we have the master's logs?

How are you observing the cluster's health?

Sorry about the delay in response. I was out on vacation. I'll share the logs soon.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.