I have an elasticsearch 6.8.23 cluster (we are upgrading to 7.x soon) with 9 data nodes and 3,000 shards where all of the replica shards have allocated to two of the nodes. I have removed all cluster.routing.allocation settings and the shards stay on these two nodes. I have also tried moving shards and all of the decisions return YES except awareness which returns NO with the reason:
there are too many copies of the shard allocated to nodes with attribute [fault_domain], there are [2] total configured shard copies for this shard id and [3] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]
There are no awareness settings set in the cluster and we aren't starting the nodes with an attribute fault_domain.
I have also tried setting cluster.routing.allocation.exclude
to exclude the two nodes and the shards do no move to other nodes.
Is there something I'm missing? I'm concerned that all of the replicas are on these two servers and not spread across the other nodes.