Codec footer mismatch - elasticsearch 1.7.5 won't start after RAID failure

unknownunknown · October 5, 2016, 1:31pm

This system encountered a RAID disk failure. After the disk was replaced, errors like this have been observed on several indices in the cluster. The array reports no further errors and appears to have been rebuilt.

This seems very similar to https://github.com/elastic/elasticsearch/issues/11520 as well as https://github.com/elastic/elasticsearch/issues/7238.

What, if any, options do I have to restore the system and recover the data?

Elasticsearch will not even start as a result. This is on a data-only node; but similar errors are observed on the master as well. I've also seen these errors prior to this as CorruptIndexExceptions; I'm in this situation beacuse I thought restarting might allow elastic to recover the data.

2016-10-04 16:16:59,266 [INFO ][node                     ] [eshost-2] version[1.7.5], pid[9864], build[00f95f4/2016-02-02T09:55:30Z]
2016-10-04 16:16:59,266 [INFO ][node                     ] [eshost-2] initializing ...
2016-10-04 16:16:59,348 [INFO ][env                      ] [eshost-2] using [1] data paths, mounts [[/var/lib/elasticsearch (/dev/sda)]], net usable_space [11tb], net total_space [29tb], types [xfs]
2016-10-04 16:17:01,164 [ERROR][gateway.local.state.shards] [eshost-2] failed to read local state (started shards), exiting...
org.elasticsearch.ElasticsearchException: codec footer mismatch: actual footer=0 vs expected footer=-1071082520 (resource: BufferedChecksumIndexInput(SimpleFSIndexInput(path="/var/lib/elasticsearch/customer/nodes/0/indices/.datastats-customer-2016.09.26/2/_state/state-3.st")))
        at org.elasticsearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:153)
        at org.elasticsearch.gateway.local.state.meta.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:297)

....

Caused by: org.elasticsearch.gateway.local.state.meta.CorruptStateException: codec footer mismatch: actual 
footer=0 vs expected footer=-1071082520 (resource: BufferedChecksumIndexInput(SimpleFSIndexInput(path="/var/lib/elasticsearch/customer/nodes/0/indices/.datastats-customer-2016.09.26/2/_state/state-3.st")))

....

And it continues similarly until it reaches

2016-10-04 16:17:01,227 [ERROR][bootstrap                ] Exception
org.elasticsearch.common.inject.CreationException: Guice creation errors:

1) Error injecting constructor, org.elasticsearch.ElasticsearchException: codec footer mismatch: actual footer=0 vs expected footer=-1071082520 (resource: BufferedChecksumIndexInput(SimpleFSIndexInput(path="/var/lib/elasticsearch/customer/nodes/0/indices/.datastats-customer-2016.09.26/2/_state/state-3.st")))

Any ideas? Is there a guide for recovering from catastrophic failures anywhere?

warkolm · October 6, 2016, 7:08am

Do you have snapshots? What about replicas?

unknownunknown · October 6, 2016, 1:16pm

Unfortunately not. I haven't used replicas in situations like this because I think there's a 10-20% reduction in indexing rate. I haven't tried snapshots, but my indices are in TBs so I'm not sure that would be practical either. I absolutely understand there's a short-term performance vs. long-term resilience tradeoff I'm making here.

warkolm · October 7, 2016, 1:38am

Why do you think that?

unknownunknown · October 7, 2016, 2:52pm

Experimenting with cluster settings, I think I observed a 15% or so hit when indexing with replicas vs. without.

Drifting back to the original problem, it was reproduced just by removing a disk from the array and replacing it. The underlying filesystem is XFS. Here is the first sign of a problem that was observed on the master:

2016-10-06 17:39:38,699 [WARN ][index.merge.scheduler    ] [test-host-1] [test-2016.10.06][3] failed to merge
java.io.IOException: Corrupted
        at org.apache.lucene.util.packed.MonotonicBlockPackedReader.<init>(MonotonicBlockPackedReader.java:80)
        at org.apache.lucene.util.packed.MonotonicBlockPackedReader.of(MonotonicBlockPackedReader.java:60)
        at org.apache.lucene.codecs.lucene410.Lucene410DocValuesProducer.getAddressInstance(Lucene410DocValuesProducer.java:427)
        at org.apache.lucene.codecs.lucene410.Lucene410DocValuesProducer.getVariableBinary(Lucene410DocValuesProducer.java:435)
        at org.apache.lucene.codecs.lucene410.Lucene410DocValuesProducer.getBinary(Lucene410DocValuesProducer.java:393)
        at org.apache.lucene.codecs.lucene410.Lucene410DocValuesProducer.getSorted(Lucene410DocValuesProducer.java:501)
        at org.apache.lucene.codecs.lucene410.Lucene410DocValuesProducer.getSortedSet(Lucene410DocValuesProducer.java:593)
        at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsReader.getSortedSet(PerFieldDocValuesFormat.java:297)
        at org.apache.lucene.index.SegmentReader.getSortedSetDocValues(SegmentReader.java:577)
        at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:207)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:122)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4223)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3811)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:409)
        at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:107)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:486)

2016-10-06 17:40:22,511 [WARN ][index.shard              ] [test-host-1] [test-2016.10.06][3] failed to refresh after setting shard to inactive
org.elasticsearch.index.engine.EngineClosedException: [test-2016.10.06][3] CurrentState[CLOSED]
        at org.elasticsearch.index.engine.Engine.ensureOpen(Engine.java:289)
        at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:565)
        at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:594)
        at org.elasticsearch.index.shard.IndexShard.updateBufferSize(IndexShard.java:1079)
        at org.elasticsearch.index.shard.IndexShard.markAsInactive(IndexShard.java:1091)
        at org.elasticsearch.indices.memory.IndexingMemoryController.markShardAsInactive(IndexingMemoryController.java:405)
        at org.elasticsearch.indices.memory.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:297)
        at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:508)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

unknownunknown · October 7, 2016, 2:53pm

And on the data node:

2016-10-06 20:39:23,739 [WARN ][index.merge.scheduler    ] [test-host-2] [test-2016.10.06][0] failed to merge
    java.lang.ArrayIndexOutOfBoundsException
            at org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)
            at org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)
            at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader$ChunkIterator.decompress(CompressingStoredFieldsReader.java:508)
            at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:378)
            at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:332)
            at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:100)
            at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4223)
            at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3811)
            at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:409)
            at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:107)
            at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:486)

2016-10-06 20:39:58,692 [WARN ][index.shard              ] [test-host-2] [test-2016.10.06][0] failed to refresh after setting shard to inactive
org.elasticsearch.index.engine.EngineClosedException: [test-2016.10.06][0] CurrentState[CLOSED]
        at org.elasticsearch.index.engine.Engine.ensureOpen(Engine.java:289)
        at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:565)
        at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:594)
        at org.elasticsearch.index.shard.IndexShard.updateBufferSize(IndexShard.java:1079)
        at org.elasticsearch.index.shard.IndexShard.markAsInactive(IndexShard.java:1091)
        at org.elasticsearch.indices.memory.IndexingMemoryController.markShardAsInactive(IndexingMemoryController.java:405)
        at org.elasticsearch.indices.memory.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:297)
        at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:508)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

I stopped elastic and restarted it. It exhibited the same restarting characteristics that were observed in production (refusing to start due to a codec failure).

Topic		Replies	Views
Corrupted Shard on Recovery Elasticsearch	10	722	July 6, 2017
Failing Replica Shards Elasticsearch	5	1204	July 6, 2017
Cant start elasticsearch FailedNodeException Elasticsearch	4	1893	April 29, 2017
CorruptIndexException when trying to replicate one shard of a new index Elasticsearch	4	1428	July 6, 2017
Corruptindexexception warn message unclear Elasticsearch	5	949	July 6, 2017

Codec footer mismatch - elasticsearch 1.7.5 won't start after RAID failure

Related topics