Dangling Index after running Curator


(James Barwick) #1

I have a situation where deleted indexes are not deleted from all nodes. As such, when ES is restarted, this Dangling Shard is imported as a Dangling Index and is thus UNASSIGNED and causes the cluster to enter "red" status and alarm bells start going off and things stop working.

I run filebeat/metricbeat and curator. Curator deletes all "metricbeat" indexes older than 30 days. It runs every night. My cluster is green for 'days' (not months..and this is a problem).

Then being already July, ES may restart for some reason (update of config or some other cycle) and then I get metricbeat indexes UNASSIGNED from january.

I then run curator again or manually delete these indexes from earlier, then the cluster goes green. (as mentioned in other tickets, curator wont necessarily see them if they are UNASSIGNED).

How can I run a program (which I am happy to do on all nodes) that would delete all index shards from the disk that don't have an index in ES? Or have an index, but are unassigned due to dangling_import? Any idea how to go about this? (need to do both, need to clean up the disk as ES is not deleting the files)

(By the way, saw the other topic regarding this from Feb 21...but, this question isn't about curator, this question is about how to I manually go about deleting these dangling indexes from the disk...they DO exist. If there is no fix for ES during index DELETE, then we need a workaround...so, how do we build the workaround?)


(Christian Dahlqvist) #2

It is strange that they do not get deleted across the cluster. What is the output of the cluster health and cat nodes APIs?


(James Barwick) #3
root@monitor:~# curl -XGET localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "perx-application",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 1373,
  "active_shards" : 2850,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

and

root@monitor:~# curl -XGET localhost:9200/_cat/nodes
172.16.15.182 55 83 10 1.02 1.02 0.81 mdi - monitor3
172.16.15.217 52 98  8 0.52 1.06 1.77 mdi - monitor1
172.16.16.190 56 97  8 0.46 0.56 0.66 mdi - monitor2
172.16.15.14   8 95  3 0.01 0.09 0.08 i   - monitor
172.16.16.173 56 98 19 0.87 1.14 1.17 mdi * monitor4

So, everything is green. But I am pretty sure those 'deleted' shards are still on the disk. They just keep coming back.

So, what I want to do is examine the disk, find a shard file, figure out if its got an index in ES, and if not, delete it. I'd like to just manually delete all dangling indexes. But I don't know how to match the file hash filename with an index name.


(Christian Dahlqvist) #4

As you have 4 master-eligible nodes, do you have minimum_master_nodes set to 3 in order to avoid split brain scenarios according to these guidelines?


(James Barwick) #5

Here is the relevant config on ALL nodes (configured by Chef)...

# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.zen.ping.unicast.hosts: [ monitor, monitor1, monitor2, monitor3, monitor4 ]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 3
#
# For more information, see the documentation at:
#

(Christian Dahlqvist) #6

That is good and rules that out as a potential cause. Which version of Elasticsearch are you running?


(andy_zhou) #7

i think you have one data node offline with about some days..
if can testing this
restart all nodes ,the delete indices can back or not?
use curl delete the indices and testing again
the indices as one day(xxx-2018.07.10) or one month(xxx-2018.07)


(James Barwick) #8

I will test again. But won't be today!!


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.