I was trying to perform an expensive query. During the query run I saw heap was getting to 100% and I received "Gateway timeout" error messages from Kibana. After that the cluster became unresponsive to any query for quite some time. Then I tried rebooting to recover the service. However, the cluster is getting stuck at 136/273 shards. No matter how many times I try to reboot, it always gets stuck at that 136 shards active. I suspect there's some corrupt shard which it is unable to recover.
I am unable to find any clue on logs that would tell me which shard is having the difficulty recovering.
I have 3 node setup. All nodes have this type of logs.
[2016-01-05 16:09:00,154][WARN ][index.translog ] [ec-dyl09026app02] [logstash-retailash-webserver-2015.12.29][1] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/logstash-retailash-webserver-2015.12.29/1/translog/translog-2495597384046063181.tlog
[2016-01-05 16:09:00,503][WARN ][index.translog ] [ec-dyl09026app02] [logstash-retailash-webserver-2015.12.29][2] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/logstash-retailash-webserver-2015.12.29/2/translog/translog-4130159341579703850.tlog
[2016-01-05 16:09:01,212][WARN ][index.translog ] [ec-dyl09026app02] [logstash-vasfulfilmenthelpdesk-helpdesklogs-2015.12.30][1] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/logstash-vasfulfilmenthelpdesk-helpdesklogs-2015.12.30/1/translog/translog-3148610471848122626.tlog
[2016-01-05 16:09:02,036][WARN ][index.translog ] [ec-dyl09026app02] [logstash-vasfulfilmenthelpdesk-helpdesklogs-2015.12.30][4] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/logstash-vasfulfilmenthelpdesk-helpdesklogs-2015.12.30/4/translog/translog-1595459646755510808.tlog
[2016-01-05 16:09:02,640][WARN ][index.translog ] [ec-dyl09026app02] [logstash-haproxy-vasf-2015.12.30][0] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/logstash-haproxy-vasf-2015.12.30/0/translog/translog-2130238499883333182.tlog
Another node:
[2016-01-05 16:09:28,610][WARN ][index.translog ] [ec-dyl09026app04] [logstash-haproxy-vasf-2015.12.30
][4] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/logstash-hapr
oxy-vasf-2015.12.30/4/translog/translog-9186999961065966729.tlog
[2016-01-05 16:09:29,158][WARN ][index.translog ] [ec-dyl09026app04] [logstash-haproxy-vasf-2015.12.30
][2] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/logstash-hapr
oxy-vasf-2015.12.30/2/translog/translog-2691222564772812976.tlog
[2016-01-05 16:09:29,891][WARN ][index.translog ] [ec-dyl09026app04] [.kibana][0] failed to delete tem
p file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/.kibana/0/translog/translog-2453105138
916241867.tlog
Another node:
[2016-01-05 16:07:49,405][WARN ][index.translog ] [ec-dyl09026app03] [logstash-retailash-webserver-2016.01.05][4] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/logstash-retailash-webserver-2016.01.05/4/translog/translog-7426401796978042194.tlog
[2016-01-05 16:07:49,555][WARN ][index.translog ] [ec-dyl09026app03] [logstash-retailash-webserver-2016.01.05][3] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/logstash-retailash-webserver-2016.01.05/3/translog/translog-3697426850331880928.tlog
[2016-01-05 16:07:50,279][WARN ][index.translog ] [ec-dyl09026app03] [logstash-vasfulfilmenthelpdesk-helpdesklogs-2016.01.05][4] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/logstash-vasfulfilmenthelpdesk-helpdesklogs-2016.01.05/4/translog/translog-7172549014729073989.tlog
[2016-01-05 16:07:50,499][WARN ][index.translog ] [ec-dyl09026app03] [logstash-vasfulfilmenthelpdesk-helpdesklogs-2016.01.05][2] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/logstash-vasfulfilmenthelpdesk-helpdesklogs-2016.01.05/2/translog/translog-3235522010474983188.tlog
I can see all of these indexes are open and in yellow state.
There's one in red:
red open topbeat-2016.01.05 5 1
Looks like this one is causing the problem recovering the cluster.
Is there a way to investigate further what is the problem with this one?