Cluster turns to red after reboot


(Yogesh BG) #22

meanwhile one more thing what i can tell you

when cluster is in green state and any one node goes down, then everything is fine. cluster goes to yellow state and all fine

even when that node comes back and joins cluster also works fine

facing this issue only when all node gets restarted at the same time


(Yogesh BG) #23

Below i shared the logs when the restart happens and before for all 6nodes

https://drive.google.com/drive/folders/1pZml3vOOMdst1OJuCyl-DxdyLe4pYPqr?usp=sharing

when we changed unicast address to include all 3 masternode ips, all the indices went to red state and in NODE_LEFT state

one thing what i want to tell is when the nodes restart happens the IP's of the container changes'

we run as a kubernetes container


(Yogesh BG) #24

One more thing i want to ask... Is there will be any issues when node restarts and rebalance is happening, active traffic is going like we try to write to index being recovered(i want to know from ES side, i know client will fail to write thats okay, but any problem at server side)


(David Turner) #25

Looking at just one unassigned shard, [logs-2018.12.04.07][1], I see this:

[2018-12-04 08:14:03,184][TRACE][gateway                  ] [metrics-master-2] [[logs-2018.12.04.07][1], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2018-12-04T08:13:54.398Z]]] on node [\{metrics-datastore-2\}\{qQ995p5ERmS0O5o7yK3VtA\}\{192.168.13.70\}\{192.168.13.70:9300\}\{max_local_storage_nodes=1, master=false\}] has version [-1] of shard\
[2018-12-04 08:14:03,184][TRACE][gateway                  ] [metrics-master-2] [[logs-2018.12.04.07][1], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2018-12-04T08:13:54.398Z]]] on node [\{metrics-datastore-1\}\{C279DcEfRDeqr2wDgJF5bQ\}\{192.168.13.214\}\{192.168.13.214:9300\}\{max_local_storage_nodes=1, master=false\}] has version [6] of shard\
[2018-12-04 08:14:03,184][TRACE][gateway                  ] [metrics-master-2] [[logs-2018.12.04.07][1], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2018-12-04T08:13:54.398Z]]] on node [\{metrics-datastore-0\}\{ZxDL21BbStCXGRD2GVieNA\}\{192.168.13.17\}\{192.168.13.17:9300\}\{max_local_storage_nodes=1, master=false\}] has version [-1] of shard\
[2018-12-04 08:14:03,184][TRACE][gateway                  ] [metrics-master-2] [logs-2018.12.04.07][1], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2018-12-04T08:13:54.398Z]] candidates for allocation: [[metrics-datastore-1] -> 6, ]\
[2018-12-04 08:14:03,184][DEBUG][gateway                  ] [metrics-master-2] [logs-2018.12.04.07][1] found 1 allocations of [logs-2018.12.04.07][1], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2018-12-04T08:13:54.398Z]], highest version: [6]\
[2018-12-04 08:14:03,184][DEBUG][gateway                  ] [metrics-master-2] [logs-2018.12.04.07][1]: not allocating, number_of_allocated_shards_found [1]\

I think this tells us that the master is looking for at least two copies of this shard, but only found one, on metrics-datastore-1. There should have been copies on one of other two data nodes, but it looks like they were unassigned before/during the shutdown of the cluster.

I think this is one of the many resilience issues fixed in later versions, perhaps "Make index creation resilient to index closing and full cluster crashes". The fix is really to upgrade.

Numerous things have been fixed in this area in more recent versions of Elasticsearch.


(Yogesh BG) #26

But in my setup at beginning there always shards are being assigned and started before i restart. According to you this could happen in situation where new index being created and shards are yet to allocate.

before restarting I always ensured cluster is in green state, 100% active shards number and all are in assigned state there are no unassigned shards.


(David Turner) #27

And yet according to the logs they're not there when the nodes come back after the restart. Apart from upgrading I don't really know what else to suggest. Maybe someone else can help.


(Yogesh BG) #28

Thanks for the help with lot of patience.

We are trying to figure out what could be the root cause, if we ca not fix using ES 2.4.

Last question i have is: we run ES nodes as kubernetes containers. When they restart the ips of each container will change. and unicast address also will change, which we are updating as part of pod restarts. So when the ES cluster comes up it comes with new unicast ips.

we made a setup in aws without using kubernetes containers, 6 node syatem, created some indices and restarted all of them together. Here IPs doesn't change they are elatic IPs.

So does this may matter in any case? this is the final suspect which we have.


(David Turner) #30

I believe that the addresses of the nodes don't matter to Elasticsearch here - they definitely don't matter to more recent versions.


(Yogesh BG) #31

Okay thank you very much