Cluster turns to red after reboot

Yogesh_BG · November 27, 2018, 8:46am

Hi

I have a ES two node setup as below

[root@metrics-datastore-0 esutilities]# sh check_cluster.sh
{
"cluster_name" : "metrics-datastore",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 1,
"active_primary_shards" : 3,
"active_shards" : 3,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 75.0
}

when i restarted it turned to RED

[root@metrics-datastore-0 esutilities]# curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
106 212 106 212 0 0 6074 0 --:--:-- --:--:-- --:--:-- 6625
logs-2018.11.27.07 0 p UNASSIGNED CLUSTER_RECOVERED

is . there a way to fix it and can you explain me how it could happened and how to prevent this?

Though cluster is in RED it accepts the requests to other indices

cy_lir · November 27, 2018, 10:54am

Hi,

Have you read this article red-elasticsearch-cluster-panic-no-longer ?

DavidTurner · November 27, 2018, 11:05am

What @cy_lir said, but the TLDR is to run GET /_cluster/allocation/explain and share the output here if it's unclear. Please use the </> button to format output, it makes it much easier to read.

Yogesh_BG · November 27, 2018, 6:46pm

I am using es 2.4, so that allocation explain api does not exist. is there any way we can debug?

DavidTurner · November 28, 2018, 8:42am

Ah, ok, then it's harder. The high-level issue is that there is no available on-disk copy of shard 0 of index logs-2018.11.27.07. Off the top of my head this will either be because:

there is an on-disk copy, but it's corrupt
the node holding the on-disk copy is no longer in the cluster.

I don't know 2.4 very well, but I think the first of these will result in lots of log messages, but I'm not sure how to determine the second. The health output you quote mentions 2 nodes of which one is a data node. Is this right, or should there be another node?

Is there anything useful in the log?

Yogesh_BG · November 28, 2018, 4:14pm

we usually keep logs in info, when restart happens it comes up with only info... and yes out of two nodes one is data and one master

Yogesh_BG · November 28, 2018, 4:18pm

Again i tested the similar scenario with 6 nodes( 3 master and 3 data). When all of them restarted ended up two indices in RED

[root@metrics-datastore-0 esutilities]# sh check_cluster.sh 
{
  "cluster_name" : "metrics-datastore",
  "status" : "red",
  "timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 3,
"active_primary_shards" : 10,
"active_shards" : 30,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 6,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 83.33333333333334
}

[root@metrics-datastore-0 esutilities]# curl -XGET localhost:9200/_cat/shards? 
 h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                             Dload  Upload   Total   Spent    Left  Speed
100  1908  100  1908    0     0  19032      0 --:--:-- --:--:-- --:--:-- 19272
logs-2018.11.28.11 0 p UNASSIGNED CLUSTER_RECOVERED 
logs-2018.11.28.11 0 r UNASSIGNED CLUSTER_RECOVERED 
logs-2018.11.28.11 0 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.25 1 p UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.25 1 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.25 1 r UNASSIGNED CLUSTER_RECOVERED

Is there a way to recover from this with out data loss, i know if only replica are in this state rerouting will help. Here primary shards also in CLUSTER_RECOVERED. Can we do something to recover?

DavidTurner · November 28, 2018, 5:18pm

In the 3-master-node case, what is the value of minimum_master_nodes in each node's configuration file?

Yogesh_BG · November 29, 2018, 6:21am

It is configured as 2. minimum_master_nodes=2

DavidTurner · November 29, 2018, 8:04am

Ok that's right for the larger cluster.

You seem to have multiple clusters with the same name. It is possible that nodes might be joining the wrong cluster when started. Does this effect still occur if you only run one cluster at a time?

Can you reproduce this on a version that isn't past the end of it's supported life, 5.6 or above?

Yogesh_BG · November 29, 2018, 12:24pm

i have only one cluster in my setup, but i am using version 2.4. Is this a kind of known issue in this version? because upgrading to latest version is a big task for us.

DavidTurner · November 29, 2018, 3:00pm

I am confused. This thread started out asking about a cluster called metrics-datastore with 1 master-eligible node and 1 data node, and then asked about a cluster with the same name with 3 master-eligible nodes and 3 data nodes. Are these the same cluster? If so, why the discrepancy in size?

Not really. I mean, if you do strange things to a cluster then yes this might lose data, but a properly managed cluster doesn't behave like this. As I said I am confused.

Yogesh_BG · November 30, 2018, 5:42am

These two are separate setups, setup1 one master node and one data node
setup2 3 master and 3 data nodes

In both the setups, when they restart i end up having CLUSTER_RECOVERED. Both of them are in separate network and separate installations.

I might be missing some configs,

In setup1 node min master = 1 and in setup2 min master = 2.

Yogesh_BG · November 30, 2018, 5:49am

gateway.expected_nodes: 1

does this cause the issue

DavidTurner · November 30, 2018, 7:29am

You just shared your AWS keys. Please rotate them immediately.

I'll look in more detail later, but that's urgent.

DavidTurner · November 30, 2018, 8:04am

This seems odd. You're telling it to try and find at least 2 (really 3) master-eligible nodes but only giving it one address to try. Perhaps you are expecting this name to resolve to multiple addresses and then for Elasticsearch to try them all, but this isn't how it works. I would try giving it the addresses of all three master-eligible nodes, or using one of the discovery plugins to discover the master-eligible nodes dynamically.

Yogesh_BG · November 30, 2018, 10:56am

ok thanks for notifying about aws keys. taken care of it.

i wil try putting addresses instead of metrics-master

Yogesh_BG · November 30, 2018, 12:47pm

tried giving service ips but didn't workout

[root@metrics-master-0 esutilities]# curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
103 2385 103 2385 0 0 74671 0 --:--:-- --:--:-- --:--:-- 76935
metrics-2018.11-10 1 p UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11-10 1 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11-10 1 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.11 2 p UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.11 2 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.11 2 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.12 0 p UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.12 0 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.30.12 0 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.25 1 p UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.25 1 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.25 1 r UNASSIGNED CLUSTER_RECOVERED

DavidTurner · November 30, 2018, 2:51pm

Ok, I think adding logger.gateway: TRACE to the config file on every node will give a little bit more detail about what's going on.

If I understand right, your problem is that you have a green cluster, with all shards assigned, but when you restart it it has unassigned shards and reports red health. If so, I would like to see logs from all nodes, starting with a green cluster, shutting everything down and starting it all back up again.

Yogesh_BG · November 30, 2018, 2:59pm

ok i will share the same

Topic		Replies	Views
ES cluster is red after restart Elasticsearch	2	491	July 6, 2017
Elasticsearch cluster status is Red with unassigned shards Elasticsearch	3	451	December 18, 2019
Cluster State Red after node restart Elasticsearch	2	343	October 7, 2019
Cluster health red Elasticsearch	4	445	July 6, 2017
Cluster status red and unassigned shards Elasticsearch	3	629	March 12, 2021

Cluster turns to red after reboot

Related topics