Cluster turns to red after reboot


(Yogesh BG) #1

Hi

I have a ES two node setup as below

[root@metrics-datastore-0 esutilities]# sh check_cluster.sh
{
"cluster_name" : "metrics-datastore",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 1,
"active_primary_shards" : 3,
"active_shards" : 3,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 75.0
}

when i restarted it turned to RED

[root@metrics-datastore-0 esutilities]# curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
106 212 106 212 0 0 6074 0 --:--:-- --:--:-- --:--:-- 6625
logs-2018.11.27.07 0 p UNASSIGNED CLUSTER_RECOVERED

is . there a way to fix it and can you explain me how it could happened and how to prevent this?

Though cluster is in RED it accepts the requests to other indices


#2

Hi,

Have you read this article red-elasticsearch-cluster-panic-no-longer ?


(David Turner) #3

What @cy_lir said, but the TLDR is to run GET /_cluster/allocation/explain and share the output here if it's unclear. Please use the </> button to format output, it makes it much easier to read.


(Yogesh BG) #4

I am using es 2.4, so that allocation explain api does not exist. is there any way we can debug?


(David Turner) #5

Ah, ok, then it's harder. The high-level issue is that there is no available on-disk copy of shard 0 of index logs-2018.11.27.07. Off the top of my head this will either be because:

  1. there is an on-disk copy, but it's corrupt
  2. the node holding the on-disk copy is no longer in the cluster.

I don't know 2.4 very well, but I think the first of these will result in lots of log messages, but I'm not sure how to determine the second. The health output you quote mentions 2 nodes of which one is a data node. Is this right, or should there be another node?

Is there anything useful in the log?


(Yogesh BG) #6

we usually keep logs in info, when restart happens it comes up with only info... and yes out of two nodes one is data and one master


(Yogesh BG) #7

Again i tested the similar scenario with 6 nodes( 3 master and 3 data). When all of them restarted ended up two indices in RED

[root@metrics-datastore-0 esutilities]# sh check_cluster.sh 
{
  "cluster_name" : "metrics-datastore",
  "status" : "red",
  "timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 3,
"active_primary_shards" : 10,
"active_shards" : 30,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 6,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 83.33333333333334
}

[root@metrics-datastore-0 esutilities]# curl -XGET localhost:9200/_cat/shards? 
 h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                             Dload  Upload   Total   Spent    Left  Speed
100  1908  100  1908    0     0  19032      0 --:--:-- --:--:-- --:--:-- 19272
logs-2018.11.28.11 0 p UNASSIGNED CLUSTER_RECOVERED 
logs-2018.11.28.11 0 r UNASSIGNED CLUSTER_RECOVERED 
logs-2018.11.28.11 0 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.25 1 p UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.25 1 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.25 1 r UNASSIGNED CLUSTER_RECOVERED 

Is there a way to recover from this with out data loss, i know if only replica are in this state rerouting will help. Here primary shards also in CLUSTER_RECOVERED. Can we do something to recover?


(David Turner) #8

In the 3-master-node case, what is the value of minimum_master_nodes in each node's configuration file?


(Yogesh BG) #9

It is configured as 2. minimum_master_nodes=2


(David Turner) #10

Ok that's right for the larger cluster.

You seem to have multiple clusters with the same name. It is possible that nodes might be joining the wrong cluster when started. Does this effect still occur if you only run one cluster at a time?

Can you reproduce this on a version that isn't past the end of it's supported life, 5.6 or above?


(Yogesh BG) #11

i have only one cluster in my setup, but i am using version 2.4. Is this a kind of known issue in this version? because upgrading to latest version is a big task for us.


(David Turner) #12

I am confused. This thread started out asking about a cluster called metrics-datastore with 1 master-eligible node and 1 data node, and then asked about a cluster with the same name with 3 master-eligible nodes and 3 data nodes. Are these the same cluster? If so, why the discrepancy in size?

Not really. I mean, if you do strange things to a cluster then yes this might lose data, but a properly managed cluster doesn't behave like this. As I said I am confused.


(Yogesh BG) #13

These two are separate setups, setup1 one master node and one data node
setup2 3 master and 3 data nodes

In both the setups, when they restart i end up having CLUSTER_RECOVERED. Both of them are in separate network and separate installations.

I might be missing some configs,

In setup1 node min master = 1 and in setup2 min master = 2.


(Yogesh BG) #15

gateway.expected_nodes: 1

does this cause the issue


(David Turner) #16

You just shared your AWS keys. Please rotate them immediately.

I'll look in more detail later, but that's urgent.


(David Turner) #17

This seems odd. You're telling it to try and find at least 2 (really 3) master-eligible nodes but only giving it one address to try. Perhaps you are expecting this name to resolve to multiple addresses and then for Elasticsearch to try them all, but this isn't how it works. I would try giving it the addresses of all three master-eligible nodes, or using one of the discovery plugins to discover the master-eligible nodes dynamically.


(Yogesh BG) #18

ok thanks for notifying about aws keys. taken care of it.

i wil try putting addresses instead of metrics-master


(Yogesh BG) #19

tried giving service ips but didn't workout

[root@metrics-master-0 esutilities]# curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
103 2385 103 2385 0 0 74671 0 --:--:-- --:--:-- --:--:-- 76935
metrics-2018.11-10 1 p UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11-10 1 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11-10 1 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.11 2 p UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.11 2 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.11 2 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.12 0 p UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.12 0 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.12 0 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.25 1 p UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.25 1 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.25 1 r UNASSIGNED CLUSTER_RECOVERED


(David Turner) #20

Ok, I think adding logger.gateway: TRACE to the config file on every node will give a little bit more detail about what's going on.

If I understand right, your problem is that you have a green cluster, with all shards assigned, but when you restart it it has unassigned shards and reports red health. If so, I would like to see logs from all nodes, starting with a green cluster, shutting everything down and starting it all back up again.


(Yogesh BG) #21

ok i will share the same