Indices not recovering after elasticsearch upgrade (1.0.2 -> 1.4.1)

Hi,

I just updated our test environment from 1.0.2 to 1.4.1 and some
indices failed to recover, which seems to be related to the checksum
verfication introduces in 1.3.

[2014-11-28 09:40:48,019][WARN ][cluster.action.shard ] [NODE1]
[index][0] received shard failed for [index][0],
node[CWq_uCPhRKqGEAvtS1jkug], [P], s[INITIALIZING], indexUUID
[yJBShgqGQgi0q5NbMms0Sg], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[index][0] failed to fetch index
version after copying it over]; nested:
CorruptIndexException[[index][0] Preexisting corrupted index
[corrupted_JysmZSaLRXWN_BgqpRSo6Q] caused by:
CorruptIndexException[checksum failed (hardware problem?) :
expected=16ncx91 actual=1xc6e7g
resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1afc89e8)]
org.apache.lucene.index.CorruptIndexException: checksum failed
(hardware problem?) : expected=16ncx91 actual=1xc6e7g
resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1afc89e8)
at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
at org.elasticsearch.index.store.Store.verify(Store.java:365)
at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

In order to get the indices to recover I check them using
org.apache.lucene.index.CheckIndex, the indices seemed ok, as no error
was reported. Reopening the indices did not solve the issue.

After deleting the checksums file as well as the corrupted_XXX marker
file, the indices finally recovered correctly. I suppose that the
verfication step here is simply skipped as there are no checksums to
compare against.

I am currently trying to understand the issue. Might it be that the
checksums file itself might have been corrupted. Also, while I did not
see any direct consequences of deleting the checksums file, I just
want to be sure that deleting them does not cause any issues.

Any thoughts or help is greatly appreciated,
Michel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH0sEYgS7ts_t%3DFHkBvHk0vyt_NXDsE_v4iLergwP0g0sy6kGw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Anyone?

On Mon, Dec 1, 2014 at 11:22 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi,

I just updated our test environment from 1.0.2 to 1.4.1 and some
indices failed to recover, which seems to be related to the checksum
verfication introduces in 1.3.

[2014-11-28 09:40:48,019][WARN ][cluster.action.shard ] [NODE1]
[index][0] received shard failed for [index][0],
node[CWq_uCPhRKqGEAvtS1jkug], [P], s[INITIALIZING], indexUUID
[yJBShgqGQgi0q5NbMms0Sg], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[index][0] failed to fetch index
version after copying it over]; nested:
CorruptIndexException[[index][0] Preexisting corrupted index
[corrupted_JysmZSaLRXWN_BgqpRSo6Q] caused by:
CorruptIndexException[checksum failed (hardware problem?) :
expected=16ncx91 actual=1xc6e7g
resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1afc89e8)]
org.apache.lucene.index.CorruptIndexException: checksum failed
(hardware problem?) : expected=16ncx91 actual=1xc6e7g
resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1afc89e8)
at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
at org.elasticsearch.index.store.Store.verify(Store.java:365)
at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

In order to get the indices to recover I check them using
org.apache.lucene.index.CheckIndex, the indices seemed ok, as no error
was reported. Reopening the indices did not solve the issue.

After deleting the checksums file as well as the corrupted_XXX marker
file, the indices finally recovered correctly. I suppose that the
verfication step here is simply skipped as there are no checksums to
compare against.

I am currently trying to understand the issue. Might it be that the
checksums file itself might have been corrupted. Also, while I did not
see any direct consequences of deleting the checksums file, I just
want to be sure that deleting them does not cause any issues.

Any thoughts or help is greatly appreciated,
Michel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH0sEYiv5UuLBaxJXkzJsoFaro93Kf4%2B2_WmjjuboFKvfQcUHA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.