I have two corrupted indexes that just won't delete

The other day, my servers all ran out of disk space at roughly the same time. That corrupted three indexes. One index I was able to recover, and curl 'n7-z01-0a2a29a5.iaas.starwave.com:9200/_cat/indices?v' it is green. However, two indexes are red, and I can't see to get rid of them.

I tried:

MGMTPROD\silvj170@n7mmadm02 ~]$ curl -XDELETE http://n7-z01-0a2a2723.iaas.starwave.com:9200/fnd-logstash-2015.06.01
{"error":"MasterNotDiscoveredException[waited for [30s]]","status":503}[MGMTPROD\silvj170@n7mmadm02 ~]$ 

I tried

curl -XPOST 'http://localhost:9200/_optimize?only_expunge_deletes=true' 
{"_shards":{"total":4760,"successful":4678,"failed":2,"failures":[{"index":"fnd-logstash-2015.05.27","shard":8,"reason":"BroadcastShardOperationFailedException[[fnd-logstash-2015.05.27][8] ]; nested: RemoteTransportException[[Solarr][inet[/10.42.41.167:9300]][indices:admin/optimize[s]]]; nested: FlushNotAllowedEngineException[[fnd-logstash-2015.05.27][8] recovery is in progress, flush [COMMIT_TRANSLOG] is not allowed]; "},{"index":"fnd-logstash-2015.05.27","shard":9,"reason":"BroadcastShardOperationFailedException[[fnd-logstash-2015.05.27][9] ]; nested: RemoteTransportException[[Turner D. Century][inet[/10.42.41.164:9300]][indices:admin/optimize[s]]]; nested: FlushNotAllowedEngineException[[fnd-logstash-2015.05.27][9] recovery is in progress, flush [COMMIT_TRANSLOG] is not allowed]; "}]}

I tried stopping all of the daemons at the same time, and deleting the index files. Then I restarted the daemons and the directories and some (but not all) of the files came back into existence!

[

root@n7-z01-0a2a29a5 ~]# ls -l /data/apps/prod/elasticsearch/data/semfs_fnd_es/nodes/0/indices/fnd-logstash-2015.05.29/
total 4
drwxr-xr-x 2 elastic elastic 4096 Jun  3 14:53 _state
[root@n7-z01-0a2a29a5 ~]# ls -l /data/apps/prod/elasticsearch/data/semfs_fnd_es/nodes/0/indices/fnd-logstash-2015.05.29/_state
total 4
-rw-r--r-- 1 elastic elastic 1044 Jun  3 14:53 state-6
[root@n7-z01-0a2a29a5 ~]# 

Because these two indexes are red, the whole cluster is red. I ran tcpdump on all of the nodes in the cluster, and I see traffic moving very fast on port 9300. I also see lots of ESTABLISHED connections on port 9300.

This used to work before my indexes got corrupted, so I think my configuration is okay. What do I do next?

Many thanks,

Jeff

MasterNotDiscoveredException would indicate that not all your nodes are part of the cluster?

What's the output from _cat/nodes?

Mark, thank you for responding, I appreciate that.

[root@n7-z01-0a2a29a4 log]# curl -XGET 'http://localhost:9200/_cat/nodes?'
n7-z01-0a2a29a8 10.42.41.168 19 16 0.00 d m Talisman          
n7-z01-0a2a2723 10.42.39.35   9 16 0.00 c - Gosamyr/tribe1    
n7-z01-0a2a2722 10.42.39.34  32 19 0.00 c - Atalon/tribe1     
n7-z01-0a2a29a2 10.42.41.162 16 15 0.29 d m Elias Bogan       
n7-z01-0a2a29a6 10.42.41.166 29 21 0.46 d * Sultan            
n7-z01-0a2a29a9 10.42.41.169 22 17 0.00 d m Annihilus         
n7-z01-0a2a29a3 10.42.41.163 22 18 0.05 d m Virgo             
n7-z01-0a2a29a7 10.42.41.167 27 18 1.15 d m Solarr            
n7-z01-0a2a29a4 10.42.41.164 21 16 0.00 d m Turner D. Century 
n7-z01-0a2a29a5 10.42.41.165 18 16 0.00 d m Aries             
[root@n7-z01-0a2a29a4 log]# 

One of the nodes is missing, 10.42.41.168. I'm at a loss to explain that, according to the initctl status elasticsearch command, it is running. Furthermore, when I do I netstat -pant, I see lots and lots of connections on TCP port 9300 to a process running java. I ran tcpdump on it and I see a moderate number of packets going by.

Jeff

I ran the curl command on the missing node, and I see the rest of the cluster:

[root@n7-z01-0a2a29a8 ~]# curl -XGET 'http://localhost:9200/_cat/nodes?'
n7-z01-0a2a29a8 10.42.41.168 21 16 0.02 d m Talisman          
n7-z01-0a2a2723 10.42.39.35  13 16 0.00 c - Gosamyr/tribe1    
n7-z01-0a2a2722 10.42.39.34  23 19 0.00 c - Atalon/tribe1     
n7-z01-0a2a29a2 10.42.41.162 19 15 0.12 d m Elias Bogan       
n7-z01-0a2a29a6 10.42.41.166 30 21 0.33 d * Sultan            
n7-z01-0a2a29a9 10.42.41.169 20 17 0.04 d m Annihilus         
n7-z01-0a2a29a3 10.42.41.163 23 18 0.00 d m Virgo             
n7-z01-0a2a29a7 10.42.41.167 26 18 1.40 d m Solarr            
n7-z01-0a2a29a4 10.42.41.164 21 16 0.54 d m Turner D. Century 
n7-z01-0a2a29a5 10.42.41.165 21 16 0.15 d m Aries             
[root@n7-z01-0a2a29a8 ~]# 

Jeff

Try running the delete against the n7-z01-0a2a29a8 node.

That worked! Thank you. Now, I want to impose on you and explain to me why it worked?

[root@n7-z01-0a2a29a8 ~]# curl -XDELETE http://localhost:9200/fnd-logstash-2015.06.01
{"acknowledged":true}[root@n7-z01-0a2a29a8 ~]# curl -XDELETE http://localhost:9200/fnd-logstash-2015.05.29
{"acknowledged":true}[root@n7-z01-0a2a29a8 ~]# curl 'localhost:9200/_cat/indices?v'
health status index                   pri rep docs.count docs.deleted store.size pri.store.size 
green  open   fnd-logstash-2015.06.02  20   1  306035762            0    596.3gb        298.1gb 
green  open   fnd-logstash-2015.05.27  20   1  281710048            0    564.6gb        282.3gb 
green  open   fnd-logstash-2015.05.31  20   1  307661198            0    635.9gb        317.9gb 
green  open   fnd-logstash-2015.05.28  20   1  278991438            0    582.6gb        291.3gb 
green  open   fnd-logstash-2015.05.30  20   1  200358182            0    430.6gb        215.3gb 
[root@n7-z01-0a2a29a8 ~]# 

My cluster is working, my sitescope monitor is green. How did you know that the missing node was important?

Again, thank you.

Jeff

The error implied it wasn't part of the cluster, so any delete requests would have failed.
I just grabbed a node at random from the ones listed in _cat/nodes as they are definitely going to part of the cluster.