I have two corrupted indexes that just won't delete


(Jeff Silverman) #1

The other day, my servers all ran out of disk space at roughly the same time. That corrupted three indexes. One index I was able to recover, and curl 'n7-z01-0a2a29a5.iaas.starwave.com:9200/_cat/indices?v' it is green. However, two indexes are red, and I can't see to get rid of them.

I tried:

MGMTPROD\silvj170@n7mmadm02 ~]$ curl -XDELETE http://n7-z01-0a2a2723.iaas.starwave.com:9200/fnd-logstash-2015.06.01
{"error":"MasterNotDiscoveredException[waited for [30s]]","status":503}[MGMTPROD\silvj170@n7mmadm02 ~]$ 

I tried

curl -XPOST 'http://localhost:9200/_optimize?only_expunge_deletes=true' 
{"_shards":{"total":4760,"successful":4678,"failed":2,"failures":[{"index":"fnd-logstash-2015.05.27","shard":8,"reason":"BroadcastShardOperationFailedException[[fnd-logstash-2015.05.27][8] ]; nested: RemoteTransportException[[Solarr][inet[/10.42.41.167:9300]][indices:admin/optimize[s]]]; nested: FlushNotAllowedEngineException[[fnd-logstash-2015.05.27][8] recovery is in progress, flush [COMMIT_TRANSLOG] is not allowed]; "},{"index":"fnd-logstash-2015.05.27","shard":9,"reason":"BroadcastShardOperationFailedException[[fnd-logstash-2015.05.27][9] ]; nested: RemoteTransportException[[Turner D. Century][inet[/10.42.41.164:9300]][indices:admin/optimize[s]]]; nested: FlushNotAllowedEngineException[[fnd-logstash-2015.05.27][9] recovery is in progress, flush [COMMIT_TRANSLOG] is not allowed]; "}]}

I tried stopping all of the daemons at the same time, and deleting the index files. Then I restarted the daemons and the directories and some (but not all) of the files came back into existence!

[

root@n7-z01-0a2a29a5 ~]# ls -l /data/apps/prod/elasticsearch/data/semfs_fnd_es/nodes/0/indices/fnd-logstash-2015.05.29/
total 4
drwxr-xr-x 2 elastic elastic 4096 Jun  3 14:53 _state
[root@n7-z01-0a2a29a5 ~]# ls -l /data/apps/prod/elasticsearch/data/semfs_fnd_es/nodes/0/indices/fnd-logstash-2015.05.29/_state
total 4
-rw-r--r-- 1 elastic elastic 1044 Jun  3 14:53 state-6
[root@n7-z01-0a2a29a5 ~]# 

Because these two indexes are red, the whole cluster is red. I ran tcpdump on all of the nodes in the cluster, and I see traffic moving very fast on port 9300. I also see lots of ESTABLISHED connections on port 9300.

This used to work before my indexes got corrupted, so I think my configuration is okay. What do I do next?

Many thanks,

Jeff


(Mark Walkom) #2

MasterNotDiscoveredException would indicate that not all your nodes are part of the cluster?

What's the output from _cat/nodes?


(Jeff Silverman) #3

Mark, thank you for responding, I appreciate that.

[root@n7-z01-0a2a29a4 log]# curl -XGET 'http://localhost:9200/_cat/nodes?'
n7-z01-0a2a29a8 10.42.41.168 19 16 0.00 d m Talisman          
n7-z01-0a2a2723 10.42.39.35   9 16 0.00 c - Gosamyr/tribe1    
n7-z01-0a2a2722 10.42.39.34  32 19 0.00 c - Atalon/tribe1     
n7-z01-0a2a29a2 10.42.41.162 16 15 0.29 d m Elias Bogan       
n7-z01-0a2a29a6 10.42.41.166 29 21 0.46 d * Sultan            
n7-z01-0a2a29a9 10.42.41.169 22 17 0.00 d m Annihilus         
n7-z01-0a2a29a3 10.42.41.163 22 18 0.05 d m Virgo             
n7-z01-0a2a29a7 10.42.41.167 27 18 1.15 d m Solarr            
n7-z01-0a2a29a4 10.42.41.164 21 16 0.00 d m Turner D. Century 
n7-z01-0a2a29a5 10.42.41.165 18 16 0.00 d m Aries             
[root@n7-z01-0a2a29a4 log]# 

One of the nodes is missing, 10.42.41.168. I'm at a loss to explain that, according to the initctl status elasticsearch command, it is running. Furthermore, when I do I netstat -pant, I see lots and lots of connections on TCP port 9300 to a process running java. I ran tcpdump on it and I see a moderate number of packets going by.

Jeff


(Jeff Silverman) #4

I ran the curl command on the missing node, and I see the rest of the cluster:

[root@n7-z01-0a2a29a8 ~]# curl -XGET 'http://localhost:9200/_cat/nodes?'
n7-z01-0a2a29a8 10.42.41.168 21 16 0.02 d m Talisman          
n7-z01-0a2a2723 10.42.39.35  13 16 0.00 c - Gosamyr/tribe1    
n7-z01-0a2a2722 10.42.39.34  23 19 0.00 c - Atalon/tribe1     
n7-z01-0a2a29a2 10.42.41.162 19 15 0.12 d m Elias Bogan       
n7-z01-0a2a29a6 10.42.41.166 30 21 0.33 d * Sultan            
n7-z01-0a2a29a9 10.42.41.169 20 17 0.04 d m Annihilus         
n7-z01-0a2a29a3 10.42.41.163 23 18 0.00 d m Virgo             
n7-z01-0a2a29a7 10.42.41.167 26 18 1.40 d m Solarr            
n7-z01-0a2a29a4 10.42.41.164 21 16 0.54 d m Turner D. Century 
n7-z01-0a2a29a5 10.42.41.165 21 16 0.15 d m Aries             
[root@n7-z01-0a2a29a8 ~]# 

Jeff


(Mark Walkom) #5

Try running the delete against the n7-z01-0a2a29a8 node.


(Jeff Silverman) #6

That worked! Thank you. Now, I want to impose on you and explain to me why it worked?

[root@n7-z01-0a2a29a8 ~]# curl -XDELETE http://localhost:9200/fnd-logstash-2015.06.01
{"acknowledged":true}[root@n7-z01-0a2a29a8 ~]# curl -XDELETE http://localhost:9200/fnd-logstash-2015.05.29
{"acknowledged":true}[root@n7-z01-0a2a29a8 ~]# curl 'localhost:9200/_cat/indices?v'
health status index                   pri rep docs.count docs.deleted store.size pri.store.size 
green  open   fnd-logstash-2015.06.02  20   1  306035762            0    596.3gb        298.1gb 
green  open   fnd-logstash-2015.05.27  20   1  281710048            0    564.6gb        282.3gb 
green  open   fnd-logstash-2015.05.31  20   1  307661198            0    635.9gb        317.9gb 
green  open   fnd-logstash-2015.05.28  20   1  278991438            0    582.6gb        291.3gb 
green  open   fnd-logstash-2015.05.30  20   1  200358182            0    430.6gb        215.3gb 
[root@n7-z01-0a2a29a8 ~]# 

My cluster is working, my sitescope monitor is green. How did you know that the missing node was important?

Again, thank you.

Jeff


(Mark Walkom) #7

The error implied it wasn't part of the cluster, so any delete requests would have failed.
I just grabbed a node at random from the ones listed in _cat/nodes as they are definitely going to part of the cluster.


(system) #8