Shards unassigned after some nodes went down

tmporary · August 28, 2020, 9:51am

Hi,

We have a ES cluster consisted of 6 nodes, 3 in one data center 3 in another data center. Each of them is master eligible but 4 are data nodes. We have 3 indices, each has 5 primary shards with 1 replica. Now due to some disaster recovery scenarios on data center went down and after that elasticsearch cluster went in status RED with reason:

"cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster"

I checked the GET _cat/shards?v=true&s=prirep

And got:

index           shard prirep state      docs store ip          node
firstIndex  2     p      STARTED       0  283b xx.xx.xx.xx datanode2-datacenterB
firstIndex  1     p      STARTED       0  283b xx.xx.xx.xx datanode1-datacenterB
firstIndex  3     p      UNASSIGNED                        
firstIndex  4     p      UNASSIGNED                        
firstIndex  0     p      UNASSIGNED                        
secondIndex 2     p      STARTED    3375 2.8mb xx.xx.xx.xx datanode1-datacenterB
secondIndex 3     p      STARTED    3416 2.2mb xx.xx.xx.xx datanode1-datacenterB
secondIndex 1     p      STARTED    3411 3.2mb xx.xx.xx.xx datanode2-datacenterB
secondIndex 4     p      UNASSIGNED                        
secondIndex 0     p      STARTED    3512 2.9mb xx.xx.xx.xx datanode1-datacenterB
thirdIndex  2     p      STARTED    4688 1.3mb xx.xx.xx.xx datanode1-datacenterB
thirdIndex  1     p      STARTED    4745 1.4mb xx.xx.xx.xx datanode2-datacenterB
thirdIndex  4     p      UNASSIGNED                        
thirdIndex  3     p      UNASSIGNED                        
thirdIndex  0     p      STARTED    4845 1.4mb xx.xx.xx.xx datanode2-datacenterB
firstIndex  2     r      STARTED       0  283b xx.xx.xx.xx datanode1-datacenterB
firstIndex  1     r      STARTED       0  283b xx.xx.xx.xx datanode2-datacenterB
firstIndex  3     r      UNASSIGNED                        
firstIndex  4     r      UNASSIGNED                        
firstIndex  0     r      UNASSIGNED                        
secondIndex 2     r      STARTED    3375 2.8mb xx.xx.xx.xx datanode2-datacenterB
secondIndex 3     r      STARTED    3416 2.2mb xx.xx.xx.xx datanode2-datacenterB
secondIndex 1     r      STARTED    3411 3.2mb xx.xx.xx.xx datanode1-datacenterB
secondIndex 4     r      UNASSIGNED                        
secondIndex 0     r      STARTED    3512 2.9mb xx.xx.xx.xx datanode2-datacenterB
thirdIndex  2     r      STARTED    4688 1.3mb xx.xx.xx.xx datanode2-datacenterB
thirdIndex  1     r      STARTED    4745 1.4mb xx.xx.xx.xx datanode1-datacenterB
thirdIndex  4     r      UNASSIGNED                        
thirdIndex  3     r      UNASSIGNED                        
thirdIndex  0     r      STARTED    4845 1.4mb xx.xx.xx.xx  datanode1-datacenterB

Could anyone suggest what can I do in this situation? Im not sure If I should add more replicas, or maybe make the primary shards number exact as number of total nodes in cluster?

Thanks

Christian_Dahlqvist · August 29, 2020, 6:40am

If you are not already you should use shard allocation awareness to make sure each shard get the primary shard allocated to one DC and the replica to the other. With this you can do with a single replica.

Also be aware that Elasticsearch can not support symmetric high availability across only 2 zones. If your cluster continued to be operation when you lost half the master eligible nodes it may very well be misconfigured, which could also lead to data loss. Which Elasticsearch version are you using? How are the nodes configured (especially minimum_master_nodes)?

tmporary · August 31, 2020, 7:32am

Hi Christian,

Thank you for your reply

I will look into the Shard Allocation feature today, thank you

We're using "7.5.2" version. The configuration is as follows:
Each node has respective names:
node.name: es1-Zone(A/B)
node.name: es2-Zone(A/B)
node.name: es3-Zone(A/B)
cluster.initial_master_nodes: es1-ZoneA,es2-ZoneA,es3-ZoneA
~
For discovery hosts each node in one data center sees itself, all other nodes and one node (master-dedicated node) in other data center (Currently is master-dedicated but throughout my testing of this sharding issue I changed those masters only to be data-eligible as well cause I thought that maybe if there're more nodes then shards can be reassigned to them. And now I cannot change it back to be master-only cause there're already data saved there).

Correct me if I'm wrong but I thought that when there're even number of nodes in a cluster then elasticsearch, while still keeping track of it, removes one from voting configuration?

warkolm · August 31, 2020, 7:49am

How far apart are these datacenters?

tmporary · August 31, 2020, 7:53am

Geographically? 3-4 states apart.

warkolm · August 31, 2020, 10:33am

If you were in Australia, that'd be the entire width of the country. Which is not supported due to latency concerns.

Christian_Dahlqvist · August 31, 2020, 10:37am

In a cluster all nodes need to see each other and be able to communicate. Distributing a cluster across data centers far apart is not supported nor recommended as it will cause performance and stability problems.

Unless you have shard allocation awareness it is possible both primary and replica for a specific shard will be allocated to the same DC which naturally impacts resiliency.

tmporary · September 1, 2020, 9:51am

Thank you @Christian_Dahlqvist, the solution with shard awareness attributes worked

I'm aware that our cluster config may be far from ideal although that was a requirement for me to configure ES in both our data centers to increase resilliency and to be ready for any Disaster Recovery scenarios. Unfortunately these are only 2 data centers I can deploy and I cannot control how far apart they are

system · September 29, 2020, 9:51am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unassigned primary and replica shards Elasticsearch	6	2067	July 6, 2017
Why shard unassigned after cluster restart completely? Elasticsearch	1	393	May 28, 2020
Unassigned shards on cluster restart Elasticsearch	1	693	October 2, 2018
Primary Shard Allocation_Failed Elasticsearch	5	1222	October 24, 2022
Unassigned primary shards. reason:NODE LEFT Elasticsearch	4	2087	January 10, 2017

Shards unassigned after some nodes went down

Related topics