Codec footer mismatch - elasticsearch 1.7.5 won't start after RAID failure

This system encountered a RAID disk failure. After the disk was replaced, errors like this have been observed on several indices in the cluster. The array reports no further errors and appears to have been rebuilt.

This seems very similar to https://github.com/elastic/elasticsearch/issues/11520 as well as https://github.com/elastic/elasticsearch/issues/7238.

What, if any, options do I have to restore the system and recover the data?

Elasticsearch will not even start as a result. This is on a data-only node; but similar errors are observed on the master as well. I've also seen these errors prior to this as CorruptIndexExceptions; I'm in this situation beacuse I thought restarting might allow elastic to recover the data.

2016-10-04 16:16:59,266 [INFO ][node                     ] [eshost-2] version[1.7.5], pid[9864], build[00f95f4/2016-02-02T09:55:30Z]
2016-10-04 16:16:59,266 [INFO ][node                     ] [eshost-2] initializing ...
2016-10-04 16:16:59,348 [INFO ][env                      ] [eshost-2] using [1] data paths, mounts [[/var/lib/elasticsearch (/dev/sda)]], net usable_space [11tb], net total_space [29tb], types [xfs]
2016-10-04 16:17:01,164 [ERROR][gateway.local.state.shards] [eshost-2] failed to read local state (started shards), exiting...
org.elasticsearch.ElasticsearchException: codec footer mismatch: actual footer=0 vs expected footer=-1071082520 (resource: BufferedChecksumIndexInput(SimpleFSIndexInput(path="/var/lib/elasticsearch/customer/nodes/0/indices/.datastats-customer-2016.09.26/2/_state/state-3.st")))
        at org.elasticsearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:153)
        at org.elasticsearch.gateway.local.state.meta.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:297)

....

Caused by: org.elasticsearch.gateway.local.state.meta.CorruptStateException: codec footer mismatch: actual 
footer=0 vs expected footer=-1071082520 (resource: BufferedChecksumIndexInput(SimpleFSIndexInput(path="/var/lib/elasticsearch/customer/nodes/0/indices/.datastats-customer-2016.09.26/2/_state/state-3.st")))

....

And it continues similarly until it reaches

2016-10-04 16:17:01,227 [ERROR][bootstrap                ] Exception
org.elasticsearch.common.inject.CreationException: Guice creation errors:

1) Error injecting constructor, org.elasticsearch.ElasticsearchException: codec footer mismatch: actual footer=0 vs expected footer=-1071082520 (resource: BufferedChecksumIndexInput(SimpleFSIndexInput(path="/var/lib/elasticsearch/customer/nodes/0/indices/.datastats-customer-2016.09.26/2/_state/state-3.st")))

Any ideas? Is there a guide for recovering from catastrophic failures anywhere?

Do you have snapshots? What about replicas?

Unfortunately not. I haven't used replicas in situations like this because I think there's a 10-20% reduction in indexing rate. I haven't tried snapshots, but my indices are in TBs so I'm not sure that would be practical either. I absolutely understand there's a short-term performance vs. long-term resilience tradeoff I'm making here.

Why do you think that?

Experimenting with cluster settings, I think I observed a 15% or so hit when indexing with replicas vs. without.

Drifting back to the original problem, it was reproduced just by removing a disk from the array and replacing it. The underlying filesystem is XFS. Here is the first sign of a problem that was observed on the master:

2016-10-06 17:39:38,699 [WARN ][index.merge.scheduler    ] [test-host-1] [test-2016.10.06][3] failed to merge
java.io.IOException: Corrupted
        at org.apache.lucene.util.packed.MonotonicBlockPackedReader.<init>(MonotonicBlockPackedReader.java:80)
        at org.apache.lucene.util.packed.MonotonicBlockPackedReader.of(MonotonicBlockPackedReader.java:60)
        at org.apache.lucene.codecs.lucene410.Lucene410DocValuesProducer.getAddressInstance(Lucene410DocValuesProducer.java:427)
        at org.apache.lucene.codecs.lucene410.Lucene410DocValuesProducer.getVariableBinary(Lucene410DocValuesProducer.java:435)
        at org.apache.lucene.codecs.lucene410.Lucene410DocValuesProducer.getBinary(Lucene410DocValuesProducer.java:393)
        at org.apache.lucene.codecs.lucene410.Lucene410DocValuesProducer.getSorted(Lucene410DocValuesProducer.java:501)
        at org.apache.lucene.codecs.lucene410.Lucene410DocValuesProducer.getSortedSet(Lucene410DocValuesProducer.java:593)
        at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsReader.getSortedSet(PerFieldDocValuesFormat.java:297)
        at org.apache.lucene.index.SegmentReader.getSortedSetDocValues(SegmentReader.java:577)
        at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:207)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:122)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4223)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3811)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:409)
        at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:107)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:486)

2016-10-06 17:40:22,511 [WARN ][index.shard              ] [test-host-1] [test-2016.10.06][3] failed to refresh after setting shard to inactive
org.elasticsearch.index.engine.EngineClosedException: [test-2016.10.06][3] CurrentState[CLOSED]
        at org.elasticsearch.index.engine.Engine.ensureOpen(Engine.java:289)
        at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:565)
        at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:594)
        at org.elasticsearch.index.shard.IndexShard.updateBufferSize(IndexShard.java:1079)
        at org.elasticsearch.index.shard.IndexShard.markAsInactive(IndexShard.java:1091)
        at org.elasticsearch.indices.memory.IndexingMemoryController.markShardAsInactive(IndexingMemoryController.java:405)
        at org.elasticsearch.indices.memory.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:297)
        at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:508)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

And on the data node:

2016-10-06 20:39:23,739 [WARN ][index.merge.scheduler    ] [test-host-2] [test-2016.10.06][0] failed to merge
    java.lang.ArrayIndexOutOfBoundsException
            at org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)
            at org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)
            at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader$ChunkIterator.decompress(CompressingStoredFieldsReader.java:508)
            at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:378)
            at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:332)
            at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:100)
            at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4223)
            at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3811)
            at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:409)
            at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:107)
            at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:486)

2016-10-06 20:39:58,692 [WARN ][index.shard              ] [test-host-2] [test-2016.10.06][0] failed to refresh after setting shard to inactive
org.elasticsearch.index.engine.EngineClosedException: [test-2016.10.06][0] CurrentState[CLOSED]
        at org.elasticsearch.index.engine.Engine.ensureOpen(Engine.java:289)
        at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:565)
        at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:594)
        at org.elasticsearch.index.shard.IndexShard.updateBufferSize(IndexShard.java:1079)
        at org.elasticsearch.index.shard.IndexShard.markAsInactive(IndexShard.java:1091)
        at org.elasticsearch.indices.memory.IndexingMemoryController.markShardAsInactive(IndexingMemoryController.java:405)
        at org.elasticsearch.indices.memory.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:297)
        at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:508)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

I stopped elastic and restarted it. It exhibited the same restarting characteristics that were observed in production (refusing to start due to a codec failure).