Trying to recover from 'red' after full disk

Somehow our drive got full and caused our Elasticsearch install to go red. I've read most of the other topics on this and none really say how to fix it.

Notable errors I'm seeing are..

[2016-06-14 10:15:19,598][INFO ][gateway ] [Schemer] recovered [13] indices into cluster_state [2016-06-14 10:15:22,827][WARN ][indices.cluster ] [Schemer] [[atp][2]] marking and sending shard failed due to [failed recovery] [atp][[atp][2]] IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: [segments_1h6h, write.lock, _1vhc.fdt, _1vhc.fdx, _1vhc.fnm, _1vhc.nvd, _1vhc.nvm, _1vhc.si, _1vhc_2.liv, _1vhc_Lucene50_0.doc, _1vhc_Lucene50_0.dvd, _1vhc_Lucene50_0.dvm, _1vhc_Lucene50_0.pos, _1vhc_Lucene50_0.tim, _1vhc_Lucene50_0.tip, _2tdm.fdt, _2tdm.fdx, _2tdm.fnm, _2tdm.nvd, _2tdm.nvm, _2tdm.si, _2tdm_33.liv, _2tdm_Lucene50_0.doc, _2tdm_Lucene50_0.dvd, _2tdm_Lucene50_0.dvm, _2tdm_Lucene50_0.pos, _2tdm_Lucene50_0.tim, _2tdm_Lucene50_0.tip, _4ecu.fdt, _4ecu.fdx, _4ecu.fnm, _4ecu.nvd, _4ecu.nvm, _4ecu.si, _4ecu_78.liv, _4ecu_Lucene50_0.doc, _4ecu_Lucene50_0.dvd, _4ecu_Lucene50_0.dvm, _4ecu_Lucene50_0.pos, _4ecu_Lucene50_0.tim, _4ecu_Lucene50_0.tip, _4mi3.cfe, _4mi3.cfs, _4mi3.si, _4mi3_1o.liv, _4rgz.cfe, _4rgz.cfs, _4rgz.si, _4rgz_3h.liv, _5180.cfe, _5180.cfs, _5180.si, _5180_2a.liv, _5932.cfe, _5932.cfs, _5932.si, _5932_1v.liv, _5c2u.cfe, _5c2u.cfs, _5c2u.si, _5c34.cfe, _5c34.cfs, _5c34.si, _5c3e.cfe, _5c3e.cfs, _5c3e.si, _5c3o.cfe, _5c3o.cfs, _5c3o.si, _5c3y.cfe, _5c3y.cfs, _5c3y.si, _5c3z.cfe, _5c3z.cfs, _5c3z.si, _5c40.cfe, _5c40.cfs, _5c40.si, _5c41.cfe, _5c41.cfs, _5c41.si, _5c42.cfe, _5c43.cfe, _5c43.cfs, _5c43.si, _l7j.fdt, _l7j.fdx, _l7j.fnm, _l7j.nvd, _l7j.nvm, _l7j.si, _l7j_5u.liv, _l7j_Lucene50_0.doc, _l7j_Lucene50_0.dvd, _l7j_Lucene50_0.dvm, _l7j_Lucene50_0.pos, _l7j_Lucene50_0.tim, _l7j_Lucene50_0.tip]]; nested: NoSuchFileException[E:\elasticsearch-2.0.0\data\elasticsearch\nodes\0\indices\atp\2\index\_5c42.si];

And then lots of

[atp][[atp][0]] IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't,

We are running 2.3 on Windows. I have no idea what I need to do at this point. I don't know how to attach logs to this post, says I can only upload images.

Lucas

You may need to manually delete the E:\elasticsearch-2.0.0\data\elasticsearch\nodes\0\indices\atp\2 directory, but you may lose data if you do that.

It is getting better. It is still red, but we just have this error now (albeit over and over)

UnavailableShardsException[[atp][0] primary shard is not active

Is there a way to activate this primary shard? Or what needs to be done?

The shard is probably lost, you may have to issue a force reroute.

Ok. I've read about that. How do you do that?
I'm using Postman to do the calls.

https://www.elastic.co/guide/en/elasticsearch/reference/2.3/cluster-reroute.html is the best resource :slight_smile:

Yeah I've seen that, but no idea what I'm doing. Am I moving? Am I allocating? Both?
Also what indexes/nodes? How do you determine any of this?

Well i tried this

{ "commands" : [ { "move" : { "index" : "atp", "shard" : 0, "from_node" : "0", "to_node" : "1" } } ] }'

And it said...

"root_cause": [ { "type": "unavailable_shards_exception", "reason": "[atp][2] primary shard is not active Timeout: [1m], request: [index {[atp][_cluster][reroute -d], source[{\r\n \"commands\" : [ {\r\n \"move\" :\r\n {\r\n \"index\" : \"atp\", \"shard\" : 0,\r\n \"from_node\" : \"0\", \"to_node\" : \"1\"\r\n }\r\n }\r\n ]\r\n}']}]" }

We deleted '2' earlier.

I feel like there should be some general doc to go through that if you are in a red state and see error X here is what you need to do...

You may need to restart node 0 unfortunately.