Already deleted indices comes back as dangling whenever a node restarts

Hi,
I am running a 8 node ES cluster with version 7.4.2
Everytime a node on the cluster restarts the state becomes red as a lot of dangling indices which are already deleted tries to get restored on the cluster.
Only solution is to manually delete all those indices again to make the cluster green.
How do i solve this permanently?

Did you try with a more recent version? 7.10.0?

Do you manually stop the node?

No, actually the ES nodes are running on AWS spot machines, so they sometimes get replaced i.e. goes down and comes back up.

Similar thing happened again, 4000 unassigned shards due to dangling indices.
/cluster/allocation/explain output :

"shard" : 5,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "DANGLING_INDEX_IMPORTED",
    "at" : "2020-11-27T08:45:39.982Z",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster"

What is the output of:

GET /
GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v

If some outputs are too big, please share them on gist.github.com and link them here.

For now i have fixed the cluster by deleting those manually.
If these outputs will help i can manually mark the instance down and cause it to become red. Please let me know.

I assume you have a number of nodes that act as master nodes and that are stable and that only part of the cluster is on spot instances?

I have a 8 node cluster in which 7 are master+data nodes and only 1 is master only node.
basically all are master eligible nodes.
also, all 8 nodes are on spot.

Having nodes come and leave the cluster like that can probably be problematic. If you ever lost more than half of the master-eligible nodes at once you would be in serious trouble. When I have seen spot instances used (not many times) it has as far as I can remember often involved having a small set of master eligible data nodes on non-spot instances and dedicated data nodes on spot instances.

got your point but how can i fix this now as this is a permanent issue whenever any node restarts.
how can it be solved permanently? any way i can delete this state from the cluster that tries to bring those deleted shards back ?
or any property to switch off some dangling indices reassignment in 7.4 version.

Yes please.

Please share your application logs while deleting the dangling indices.

It can be helpful to find on which node does having the problematic indices.

You have to choose odd numbers (like 1,3,5,7..) for master nodes to cluster.

You have eight master nodes for cluster, it's not recommended.

I assume you have changed the node role, like master node to master/data node. Due to this only dangling indices will create.

This is not true; 3 is the recommended number but if you don't want to follow that advice then it doesn't really matter whether the number you pick is even or odd.

I think there's something wrong in the OP's orchestration. Deleting an index will delete all the files on disk, but since these are spot instances that seem to get randomly resurrected I suspect they're reverting to an older state.

I recommend confirming firstly that the index is really being deleted from disk. You can use GET _cat/indices to get the index UUID, and then use find to verify that all the directories named after the index UUID are gone after the DELETE. For instance, here's me deleting an index called indexname with UUID 8M4Kpm5ERuC7q-4hGMeYBA:

$ curl 'http://localhost:9200/_cat/indices/indexname?v'
health status index     uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   indexname 8M4Kpm5ERuC7q-4hGMeYBA   1   1          0            0       208b           208b
$ find elasticsearch-7.10.0/data-0 -name 8M4Kpm5ERuC7q-4hGMeYBA
elasticsearch-7.10.0/data-0/nodes/0/indices/8M4Kpm5ERuC7q-4hGMeYBA
$ curl -XDELETE 'http://localhost:9200/indexname'
{"acknowledged":true}
$ find elasticsearch-7.10.0/data-0 -name 8M4Kpm5ERuC7q-4hGMeYBA
$ # didn't find anything

You'll need to check that on all 8 nodes. If the directory is being deleted then Elasticsearch will never create it again, so if it comes back again it's not Elasticsearch's doing.

Also if you upgrade to a more recent version you can list/delete dangling indices via APIs directly (see List dangling indices API | Elasticsearch Guide [8.11] | Elastic). This isn't a permanent fix, nor does it explain what's going on, but it is at least an improvement.

IMPORTANT EDIT: don't infer from this that you can delete anything from the data directory yourself -- deleting an index is more than just deleting this one folder, and you should never even consider modifying the contents of the data directory by hand. What I suggest above is just observing that Elasticsearch really does delete the directory.

2 Likes

If we choose odd no.of master nodes then we can avoid split brain function.
Using minimum number of master nodes parameter.
n/2+1 (n is number of master nodes).

That's why I have suggested this.

I assume one node having index metadata due to this dangling indices created.

I have this faced this.

This does not require an odd number of master nodes.

The discovery.zen.minimum_master_nodes setting has no effect in this version.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.