3 Node Cluster health is always YELLOW

Hi,
I am having a 3 Node cluster (2 Data nodes and 3 Master Eligible) with minimum_master_nodes : 2. I have all defaults for shards and Replicas (5 and 1 respectively). But my cluster health is always showing YELLOW.

For testing the production maintenance scenarios, i was making 1 node down at a time. Following were the scenarios:

When i make the 1st Data Node Down, the cluster health turns RED
When i bring back the 1st Data Node back, the cluster health turns YELLOW.
When i make the 2nd Data Node Down, the cluster health turns YELLOW
When i make the 3rd Master Node Down, the cluster health turns YELLOW.

There is no scenario where my Cluster health is GREEN.

Is there any issue with my configuration? Please advice

Attaching the Cluster health of indices

{
  "cluster_name": "production",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 3,
  "number_of_data_nodes": 2,
  "active_primary_shards": 6,
  "active_shards": 9,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 3,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 75,
  "indices": {
    ".kibana": {
      "status": "yellow",
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "active_primary_shards": 1,
      "active_shards": 1,
      "relocating_shards": 0,
      "initializing_shards": 0,
      "unassigned_shards": 1
    },
    "logindex-2016.08.30": {
      "status": "yellow",
      "number_of_shards": 5,
      "number_of_replicas": 1,
      "active_primary_shards": 5,
      "active_shards": 8,
      "relocating_shards": 0,
      "initializing_shards": 0,
      "unassigned_shards": 2
    }
  }
}

Thanks,
RK

Are you allowing the cluster to recover from taking these nodes down?

1 Like

Hi Mark,

Thanks for your reply !!

So if i understand correctly, cluster may require some time (depending on the data it contains) to recover and turn GREEN/YELLOW. Am i correct ?

But i have a question here..When i have 2 data nodes and i bring one of it down, the cluster should ideally turn YELLOW but not RED. Here in this scenario, does it also take time to turn YELLOW ? I am concerned here because the cluster will not serve any requests till it turns at least YELLOW.

I waited for more than 5 hours by making the nodes up and running, but still the cluster shows YELLOW with following stats

I see there is no data on Unassigned shards. So i'm assuming the shards get allocated only when the data is present. Am i correct?

Thanks,
RK

Yes. It's eventual, not immediate.

Depends, do all your indices have replicas?

No, but let's come back to this.

yes, its default setting (shards -5 , replicas -1)

If you have 1 set of replicas and more than one node, and you remove one node, then it should go from red>yellow reasonably quickly as it detects this node loss and compensates.

Are you saying this never happens?

1 Like

yes, cluster is not changing from RED to YELLOW

currently i have following data with no new data coming up

and my cluster state is YELLOW

when i bring down one of the nodes, the cluster state is RED with following stats.

I see replica shards are not allocating on this particular node 10.227.198.

Is there any setting we need to specify explicitly for allocation of replica shards?

Thanks,
RK

Please don't post pictures of text, they are difficult to read and some people may not be even able to see them.

Before you remove a node, is your cluster green?

No, its YELLOW

Then you need to fix that.

Check _cat/recovery and _cat/allocation.

1 Like

GET _cat/allocation?v gives following result

shards disk.indices disk.used disk.avail disk.total disk.percent host          ip            node          
    11          2mb       5gb     93.2gb     98.3gb            5 10.227.205.24 10.227.205.24 ogawsl78491dv 
     0           0b       5gb     93.2gb     98.3gb            5 10.227.198.99 10.227.198.99 ogawsl78492dv 
    11                                                                                       UNASSIGNED  

GET _cat/recovery?v

index               shard time type  stage source_host   target_host   repository snapshot files files_percent bytes bytes_percent total_files total_bytes translog translog_percent total_translog 
logindex-2016.08.31 0     124  store done  10.227.205.24 10.227.205.24 n/a        n/a      0     100.0%        0     100.0%        4           4364        0        100.0%           0              
logindex-2016.08.31 1     156  store done  10.227.205.24 10.227.205.24 n/a        n/a      0     100.0%        0     100.0%        7           7401        0        100.0%           0              
logindex-2016.08.31 2     129  store done  10.227.205.24 10.227.205.24 n/a        n/a      0     100.0%        0     100.0%        4           3768        0        100.0%           0              
logindex-2016.08.31 3     146  store done  10.227.205.24 10.227.205.24 n/a        n/a      0     100.0%        0     100.0%        4           4245        0        100.0%           0              
logindex-2016.08.31 4     20   store done  10.227.205.24 10.227.205.24 n/a        n/a      0     100.0%        0     100.0%        1           159         0        100.0%           0              
.kibana             0     25   store done  10.227.205.24 10.227.205.24 n/a        n/a      0     100.0%        0     100.0%        4           3254        0        100.0%           0              
logindex-2016.08.30 0     57   store done  10.227.205.24 10.227.205.24 n/a        n/a      0     100.0%        0     100.0%        29          435885      0        100.0%           0              
logindex-2016.08.30 1     42   store done  10.227.205.24 10.227.205.24 n/a        n/a      0     100.0%        0     100.0%        20          396800      0        100.0%           0              
logindex-2016.08.30 2     55   store done  10.227.205.24 10.227.205.24 n/a        n/a      0     100.0%        0     100.0%        26          431020      0        100.0%           0              
logindex-2016.08.30 3     29   store done  10.227.205.24 10.227.205.24 n/a        n/a      0     100.0%        0     100.0%        20          411695      0        100.0%           0              
logindex-2016.08.30 4     51   store done  10.227.205.24 10.227.205.24 n/a        n/a      0     100.0%        0     100.0%        34          421654      0        100.0%           0              

Any Advice?

Thanks,
RK

Try removing the replicas and readding them.

1 Like

With this command curl -XGET http://localhost:9200/_cat/shards?v , i came to know that all the shards are not getting allocated on Node2 when Node1 was down. So i made my 3rd node as Data node and it worked fine. With this i figured out that there was a problem with my 2nd Data node installation. So i have uninstalled the Elasticsearch on that node and reinstalled back and now it works perfectlty fine as expected.

Thanks for your time and suggestions !! they really help.

Thanks,
RK