I have two corrupted indexes that just won't delete

The other day, my servers all ran out of disk space at roughly the same time. That corrupted three indexes. One index I was able to recover, and curl 'n7-z01-0a2a29a5.iaas.starwave.com:9200/_cat/indices?v' it is green. However, two indexes are red, and I can't see to get rid of them.

I tried:

MGMTPROD\silvj170@n7mmadm02 ~]$ curl -XDELETE http://n7-z01-0a2a2723.iaas.starwave.com:9200/fnd-logstash-2015.06.01
{"error":"MasterNotDiscoveredException[waited for [30s]]","status":503}[MGMTPROD\silvj170@n7mmadm02 ~]$ 

I tried

curl -XPOST 'http://localhost:9200/_optimize?only_expunge_deletes=true' 
{"_shards":{"total":4760,"successful":4678,"failed":2,"failures":[{"index":"fnd-logstash-2015.05.27","shard":8,"reason":"BroadcastShardOperationFailedException[[fnd-logstash-2015.05.27][8] ]; nested: RemoteTransportException[[Solarr][inet[/]][indices:admin/optimize[s]]]; nested: FlushNotAllowedEngineException[[fnd-logstash-2015.05.27][8] recovery is in progress, flush [COMMIT_TRANSLOG] is not allowed]; "},{"index":"fnd-logstash-2015.05.27","shard":9,"reason":"BroadcastShardOperationFailedException[[fnd-logstash-2015.05.27][9] ]; nested: RemoteTransportException[[Turner D. Century][inet[/]][indices:admin/optimize[s]]]; nested: FlushNotAllowedEngineException[[fnd-logstash-2015.05.27][9] recovery is in progress, flush [COMMIT_TRANSLOG] is not allowed]; "}]}

I tried stopping all of the daemons at the same time, and deleting the index files. Then I restarted the daemons and the directories and some (but not all) of the files came back into existence!


root@n7-z01-0a2a29a5 ~]# ls -l /data/apps/prod/elasticsearch/data/semfs_fnd_es/nodes/0/indices/fnd-logstash-2015.05.29/
total 4
drwxr-xr-x 2 elastic elastic 4096 Jun  3 14:53 _state
[root@n7-z01-0a2a29a5 ~]# ls -l /data/apps/prod/elasticsearch/data/semfs_fnd_es/nodes/0/indices/fnd-logstash-2015.05.29/_state
total 4
-rw-r--r-- 1 elastic elastic 1044 Jun  3 14:53 state-6
[root@n7-z01-0a2a29a5 ~]# 

Because these two indexes are red, the whole cluster is red. I ran tcpdump on all of the nodes in the cluster, and I see traffic moving very fast on port 9300. I also see lots of ESTABLISHED connections on port 9300.

This used to work before my indexes got corrupted, so I think my configuration is okay. What do I do next?

Many thanks,


MasterNotDiscoveredException would indicate that not all your nodes are part of the cluster?

What's the output from _cat/nodes?

Mark, thank you for responding, I appreciate that.

[root@n7-z01-0a2a29a4 log]# curl -XGET 'http://localhost:9200/_cat/nodes?'
n7-z01-0a2a29a8 19 16 0.00 d m Talisman          
n7-z01-0a2a2723   9 16 0.00 c - Gosamyr/tribe1    
n7-z01-0a2a2722  32 19 0.00 c - Atalon/tribe1     
n7-z01-0a2a29a2 16 15 0.29 d m Elias Bogan       
n7-z01-0a2a29a6 29 21 0.46 d * Sultan            
n7-z01-0a2a29a9 22 17 0.00 d m Annihilus         
n7-z01-0a2a29a3 22 18 0.05 d m Virgo             
n7-z01-0a2a29a7 27 18 1.15 d m Solarr            
n7-z01-0a2a29a4 21 16 0.00 d m Turner D. Century 
n7-z01-0a2a29a5 18 16 0.00 d m Aries             
[root@n7-z01-0a2a29a4 log]# 

One of the nodes is missing, I'm at a loss to explain that, according to the initctl status elasticsearch command, it is running. Furthermore, when I do I netstat -pant, I see lots and lots of connections on TCP port 9300 to a process running java. I ran tcpdump on it and I see a moderate number of packets going by.


I ran the curl command on the missing node, and I see the rest of the cluster:

[root@n7-z01-0a2a29a8 ~]# curl -XGET 'http://localhost:9200/_cat/nodes?'
n7-z01-0a2a29a8 21 16 0.02 d m Talisman          
n7-z01-0a2a2723  13 16 0.00 c - Gosamyr/tribe1    
n7-z01-0a2a2722  23 19 0.00 c - Atalon/tribe1     
n7-z01-0a2a29a2 19 15 0.12 d m Elias Bogan       
n7-z01-0a2a29a6 30 21 0.33 d * Sultan            
n7-z01-0a2a29a9 20 17 0.04 d m Annihilus         
n7-z01-0a2a29a3 23 18 0.00 d m Virgo             
n7-z01-0a2a29a7 26 18 1.40 d m Solarr            
n7-z01-0a2a29a4 21 16 0.54 d m Turner D. Century 
n7-z01-0a2a29a5 21 16 0.15 d m Aries             
[root@n7-z01-0a2a29a8 ~]# 


Try running the delete against the n7-z01-0a2a29a8 node.

That worked! Thank you. Now, I want to impose on you and explain to me why it worked?

[root@n7-z01-0a2a29a8 ~]# curl -XDELETE http://localhost:9200/fnd-logstash-2015.06.01
{"acknowledged":true}[root@n7-z01-0a2a29a8 ~]# curl -XDELETE http://localhost:9200/fnd-logstash-2015.05.29
{"acknowledged":true}[root@n7-z01-0a2a29a8 ~]# curl 'localhost:9200/_cat/indices?v'
health status index                   pri rep docs.count docs.deleted store.size pri.store.size 
green  open   fnd-logstash-2015.06.02  20   1  306035762            0    596.3gb        298.1gb 
green  open   fnd-logstash-2015.05.27  20   1  281710048            0    564.6gb        282.3gb 
green  open   fnd-logstash-2015.05.31  20   1  307661198            0    635.9gb        317.9gb 
green  open   fnd-logstash-2015.05.28  20   1  278991438            0    582.6gb        291.3gb 
green  open   fnd-logstash-2015.05.30  20   1  200358182            0    430.6gb        215.3gb 
[root@n7-z01-0a2a29a8 ~]# 

My cluster is working, my sitescope monitor is green. How did you know that the missing node was important?

Again, thank you.


The error implied it wasn't part of the cluster, so any delete requests would have failed.
I just grabbed a node at random from the ones listed in _cat/nodes as they are definitely going to part of the cluster.