Trying to recover from 'red' after full disk

Lucas_Graf · June 14, 2016, 5:47pm

Somehow our drive got full and caused our Elasticsearch install to go red. I've read most of the other topics on this and none really say how to fix it.

Notable errors I'm seeing are..

[2016-06-14 10:15:19,598][INFO ][gateway ] [Schemer] recovered [13] indices into cluster_state [2016-06-14 10:15:22,827][WARN ][indices.cluster ] [Schemer] [[atp][2]] marking and sending shard failed due to [failed recovery] [atp][[atp][2]] IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: [segments_1h6h, write.lock, _1vhc.fdt, _1vhc.fdx, _1vhc.fnm, _1vhc.nvd, _1vhc.nvm, _1vhc.si, _1vhc_2.liv, _1vhc_Lucene50_0.doc, _1vhc_Lucene50_0.dvd, _1vhc_Lucene50_0.dvm, _1vhc_Lucene50_0.pos, _1vhc_Lucene50_0.tim, _1vhc_Lucene50_0.tip, _2tdm.fdt, _2tdm.fdx, _2tdm.fnm, _2tdm.nvd, _2tdm.nvm, _2tdm.si, _2tdm_33.liv, _2tdm_Lucene50_0.doc, _2tdm_Lucene50_0.dvd, _2tdm_Lucene50_0.dvm, _2tdm_Lucene50_0.pos, _2tdm_Lucene50_0.tim, _2tdm_Lucene50_0.tip, _4ecu.fdt, _4ecu.fdx, _4ecu.fnm, _4ecu.nvd, _4ecu.nvm, _4ecu.si, _4ecu_78.liv, _4ecu_Lucene50_0.doc, _4ecu_Lucene50_0.dvd, _4ecu_Lucene50_0.dvm, _4ecu_Lucene50_0.pos, _4ecu_Lucene50_0.tim, _4ecu_Lucene50_0.tip, _4mi3.cfe, _4mi3.cfs, _4mi3.si, _4mi3_1o.liv, _4rgz.cfe, _4rgz.cfs, _4rgz.si, _4rgz_3h.liv, _5180.cfe, _5180.cfs, _5180.si, _5180_2a.liv, _5932.cfe, _5932.cfs, _5932.si, _5932_1v.liv, _5c2u.cfe, _5c2u.cfs, _5c2u.si, _5c34.cfe, _5c34.cfs, _5c34.si, _5c3e.cfe, _5c3e.cfs, _5c3e.si, _5c3o.cfe, _5c3o.cfs, _5c3o.si, _5c3y.cfe, _5c3y.cfs, _5c3y.si, _5c3z.cfe, _5c3z.cfs, _5c3z.si, _5c40.cfe, _5c40.cfs, _5c40.si, _5c41.cfe, _5c41.cfs, _5c41.si, _5c42.cfe, _5c43.cfe, _5c43.cfs, _5c43.si, _l7j.fdt, _l7j.fdx, _l7j.fnm, _l7j.nvd, _l7j.nvm, _l7j.si, _l7j_5u.liv, _l7j_Lucene50_0.doc, _l7j_Lucene50_0.dvd, _l7j_Lucene50_0.dvm, _l7j_Lucene50_0.pos, _l7j_Lucene50_0.tim, _l7j_Lucene50_0.tip]]; nested: NoSuchFileException[E:\elasticsearch-2.0.0\data\elasticsearch\nodes\0\indices\atp\2\index\_5c42.si];

And then lots of

[atp][[atp][0]] IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't,

We are running 2.3 on Windows. I have no idea what I need to do at this point. I don't know how to attach logs to this post, says I can only upload images.

Lucas

warkolm · June 15, 2016, 6:25am

You may need to manually delete the E:\elasticsearch-2.0.0\data\elasticsearch\nodes\0\indices\atp\2 directory, but you may lose data if you do that.

Lucas_Graf · June 16, 2016, 6:00pm

It is getting better. It is still red, but we just have this error now (albeit over and over)

UnavailableShardsException[[atp][0] primary shard is not active

Is there a way to activate this primary shard? Or what needs to be done?

warkolm · June 16, 2016, 8:58pm

The shard is probably lost, you may have to issue a force reroute.

Lucas_Graf · June 16, 2016, 9:14pm

Ok. I've read about that. How do you do that?
I'm using Postman to do the calls.

warkolm · June 16, 2016, 10:07pm

https://www.elastic.co/guide/en/elasticsearch/reference/2.3/cluster-reroute.html is the best resource

Lucas_Graf · June 16, 2016, 10:20pm

Yeah I've seen that, but no idea what I'm doing. Am I moving? Am I allocating? Both?
Also what indexes/nodes? How do you determine any of this?

Lucas_Graf · June 16, 2016, 11:17pm

Well i tried this

{ "commands" : [ { "move" : { "index" : "atp", "shard" : 0, "from_node" : "0", "to_node" : "1" } } ] }'

And it said...

"root_cause": [ { "type": "unavailable_shards_exception", "reason": "[atp][2] primary shard is not active Timeout: [1m], request: [index {[atp][_cluster][reroute -d], source[{\r\n \"commands\" : [ {\r\n \"move\" :\r\n {\r\n \"index\" : \"atp\", \"shard\" : 0,\r\n \"from_node\" : \"0\", \"to_node\" : \"1\"\r\n }\r\n }\r\n ]\r\n}']}]" }

We deleted '2' earlier.

I feel like there should be some general doc to go through that if you are in a red state and see error X here is what you need to do...

warkolm · June 17, 2016, 3:32am

You may need to restart node 0 unfortunately.

Topic		Replies	Views
ES 0.20.5 stuck in RED after node loss OR how do I configured to avoid problems? Elasticsearch	4	407	July 6, 2017
ES Cluster Recovery and Restart Elasticsearch	3	623	July 6, 2017
Elasticsearch issue Elasticsearch	13	2059	July 6, 2017
Recover from red status Elasticsearch	3	714	July 5, 2017
Elasticsearch is giving error as "failed to start shard" Elasticsearch	6	697	July 6, 2017

Trying to recover from 'red' after full disk

Related topics