Cluster turns to red after reboot

Hi

I have a ES two node setup as below

[root@metrics-datastore-0 esutilities]# sh check_cluster.sh
{
"cluster_name" : "metrics-datastore",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 1,
"active_primary_shards" : 3,
"active_shards" : 3,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 75.0
}

when i restarted it turned to RED

[root@metrics-datastore-0 esutilities]# curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
106 212 106 212 0 0 6074 0 --:--:-- --:--:-- --:--:-- 6625
logs-2018.11.27.07 0 p UNASSIGNED CLUSTER_RECOVERED

is . there a way to fix it and can you explain me how it could happened and how to prevent this?

Though cluster is in RED it accepts the requests to other indices

Hi,

Have you read this article red-elasticsearch-cluster-panic-no-longer ?

What @cy_lir said, but the TLDR is to run GET /_cluster/allocation/explain and share the output here if it's unclear. Please use the </> button to format output, it makes it much easier to read.

I am using es 2.4, so that allocation explain api does not exist. is there any way we can debug?

Ah, ok, then it's harder. The high-level issue is that there is no available on-disk copy of shard 0 of index logs-2018.11.27.07. Off the top of my head this will either be because:

  1. there is an on-disk copy, but it's corrupt
  2. the node holding the on-disk copy is no longer in the cluster.

I don't know 2.4 very well, but I think the first of these will result in lots of log messages, but I'm not sure how to determine the second. The health output you quote mentions 2 nodes of which one is a data node. Is this right, or should there be another node?

Is there anything useful in the log?

we usually keep logs in info, when restart happens it comes up with only info... and yes out of two nodes one is data and one master

Again i tested the similar scenario with 6 nodes( 3 master and 3 data). When all of them restarted ended up two indices in RED

[root@metrics-datastore-0 esutilities]# sh check_cluster.sh 
{
  "cluster_name" : "metrics-datastore",
  "status" : "red",
  "timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 3,
"active_primary_shards" : 10,
"active_shards" : 30,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 6,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 83.33333333333334
}

[root@metrics-datastore-0 esutilities]# curl -XGET localhost:9200/_cat/shards? 
 h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                             Dload  Upload   Total   Spent    Left  Speed
100  1908  100  1908    0     0  19032      0 --:--:-- --:--:-- --:--:-- 19272
logs-2018.11.28.11 0 p UNASSIGNED CLUSTER_RECOVERED 
logs-2018.11.28.11 0 r UNASSIGNED CLUSTER_RECOVERED 
logs-2018.11.28.11 0 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.25 1 p UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.25 1 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.25 1 r UNASSIGNED CLUSTER_RECOVERED 

Is there a way to recover from this with out data loss, i know if only replica are in this state rerouting will help. Here primary shards also in CLUSTER_RECOVERED. Can we do something to recover?

In the 3-master-node case, what is the value of minimum_master_nodes in each node's configuration file?

It is configured as 2. minimum_master_nodes=2

Ok that's right for the larger cluster.

You seem to have multiple clusters with the same name. It is possible that nodes might be joining the wrong cluster when started. Does this effect still occur if you only run one cluster at a time?

Can you reproduce this on a version that isn't past the end of it's supported life, 5.6 or above?

i have only one cluster in my setup, but i am using version 2.4. Is this a kind of known issue in this version? because upgrading to latest version is a big task for us.

I am confused. This thread started out asking about a cluster called metrics-datastore with 1 master-eligible node and 1 data node, and then asked about a cluster with the same name with 3 master-eligible nodes and 3 data nodes. Are these the same cluster? If so, why the discrepancy in size?

Not really. I mean, if you do strange things to a cluster then yes this might lose data, but a properly managed cluster doesn't behave like this. As I said I am confused.

These two are separate setups, setup1 one master node and one data node
setup2 3 master and 3 data nodes

In both the setups, when they restart i end up having CLUSTER_RECOVERED. Both of them are in separate network and separate installations.

I might be missing some configs,

In setup1 node min master = 1 and in setup2 min master = 2.

gateway.expected_nodes: 1

does this cause the issue

You just shared your AWS keys. Please rotate them immediately.

I'll look in more detail later, but that's urgent.

1 Like

This seems odd. You're telling it to try and find at least 2 (really 3) master-eligible nodes but only giving it one address to try. Perhaps you are expecting this name to resolve to multiple addresses and then for Elasticsearch to try them all, but this isn't how it works. I would try giving it the addresses of all three master-eligible nodes, or using one of the discovery plugins to discover the master-eligible nodes dynamically.

ok thanks for notifying about aws keys. taken care of it.

i wil try putting addresses instead of metrics-master

tried giving service ips but didn't workout

[root@metrics-master-0 esutilities]# curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
103 2385 103 2385 0 0 74671 0 --:--:-- --:--:-- --:--:-- 76935
metrics-2018.11-10 1 p UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11-10 1 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11-10 1 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.11 2 p UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.11 2 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.11 2 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.12 0 p UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.12 0 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.12 0 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.25 1 p UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.25 1 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.25 1 r UNASSIGNED CLUSTER_RECOVERED

Ok, I think adding logger.gateway: TRACE to the config file on every node will give a little bit more detail about what's going on.

If I understand right, your problem is that you have a green cluster, with all shards assigned, but when you restart it it has unassigned shards and reports red health. If so, I would like to see logs from all nodes, starting with a green cluster, shutting everything down and starting it all back up again.

ok i will share the same