Corrupted elastic index

my instance of elasticsearch (5.6.1) stopped indexing because of the following error

[2018-06-13T16:39:07,413][ERROR][o.e.i.e.InternalEngine$EngineMergeScheduler] [tracer-node-1] [tracer-default-logs-network-2018.06.13][0] failed to merge org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=e0722284 actual=829b88e5 .....

when i looked into the index folder on the filesystem i found the following file:

-rw-r--r-- 1 elasticsearch elasticsearch 2.1K Jun 13 16:39 corrupted_hhX5NOZcSAqV8Mir0_Lfjw

it's content:

 ?×l^W^Estore^@^@^@^Bù^O^A^Q^A&failed engine (reason: [merge failed])^A^A^AGchecksum failed (hardware problem?) : expected=e0722284 actual=829b88e5^A<8c>^ABufferedChecksumIndexInput(MMapIndexInput(path="/var/lib/elasticsearch/nodes/0/indices/F_s-K-1YSFmmZOa3Z4ZMig/0/index/_8uo_Lucene50_0.tim"))^@^N"org.apache.lucene.codecs.CodecUtil^A^NCodecUtil.java^KcheckFooter£^C"org.apache.lucene.codecs.CodecUtil^A^NCodecUtil.java^RchecksumEntireFile<8e>^D7org.apache.lucene.codecs.blocktree.BlockTreeTermsReader^A^YBlockTreeTermsReader.java^NcheckIntegrityÐ^BEorg.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader^A^[PerFieldPostingsFormat.java^NcheckIntegrityÜ^BIorg.apache.lucene.codecs.perfield.PerFieldMergeState$FilterFieldsProducer^A^WPerFieldMergeState.java^NcheckIntegrity<8f>^B'org.apache.lucene.codecs.FieldsConsumer^A^SFieldsConsumer.java^Emerge`Eorg.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter^A^[PerFieldPostingsFormat.java^Emerge¤^A%org.apache.lucene.index.SegmentMerger^A^RSegmentMerger.java
mergeTermsØ^A%org.apache.lucene.index.SegmentMerger^A^RSegmentMerger.java^Emergee#org.apache.lucene.index.IndexWriter^A^PIndexWriter.java^KmergeMiddle<84>"#org.apache.lucene.index.IndexWriter^A^PIndexWriter.java^EmergeÛ^^0org.apache.lucene.index.ConcurrentMergeScheduler^A^]ConcurrentMergeScheduler.java^GdoMergeð^DDorg.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler^A*ElasticsearchConcurrentMergeScheduler.java^GdoMergec<org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread^A^]ConcurrentMergeScheduler.java^Crun<95>^E^@^G%org.elasticsearch.index.engine.Engine^A^KEngine.java
failEngine¯^FDorg.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2^A^SInternalEngine.java^EdoRun<90>^LXorg.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable^A^RThreadContext.java^EdoRunþ^D9org.elasticsearch.common.util.concurrent.AbstractRunnable^A^UAbstractRunnable.java^Crun%'java.util.concurrent.ThreadPoolExecutor^A^WThreadPoolExecutor.java     runWorkerö^H.java.util.concurrent.ThreadPoolExecutor$Worker^A^WThreadPoolExecutor.java^Cruné^D^Pjava.lang.Thread^A^KThread.java^Cruné^E^@À(<93>è^@^@^@^@^@^@^@^@K.OÁ

~

i tried to check and fix the index by

java -cp lucene-core*.jar -ea:org.apache.lucene… org.apache.lucene.index.CheckIndex /var/lib/elasticsearch/nodes/0/indices/F_s-K-1YSFmmZOa3Z4ZMig/1/index -verbose -exorcise

restarted elasticsearch, cluster was still red

then i moved the index out of the elasticsearch folder, after restart the cluster worked again

so, how can i fix this index? what could be the reason for this failure? thanks!

1 Like

This is an indication of either hardware errors (check dmesg) or that it maybe ran out of disk space.

At this point segments are corrupted and data is lost, meaning that you can't recover the whole index anymore. Unless you have a snapshot (which is recommended for production).

There are a couple of options to try to partially recover this index:

  1. Try to partially recover the corrupted shard:
    1. Close the index.
    2. Set index.shard.check_on_startup: fix for this index.
    3. Open the index. At this time index will start to be verified and may take a long time.
    4. If it recovers, then you need to redo step 1 to 3 but set index.shard.check_on_startup: false otherwise it will always try to fix when it opens again.
  2. If shard can't be partially recovered then the only way is to completely drop it so at least the index can be recovered with the other healthy shards. For doing that you could try the allocate_empty_primary command of Cluster Reroute API.

None of these are guaranteed to work as it is highly dependent of the type of damage.

6 Likes

thank you, i'll try it out and give feedback

Note that the corruption marker file corrupted_* will prevent the shard from being allocated as primary. This file is managed by Elasticsearch, and is unaware of the fact that you've fixed the index using Lucene's CheckIndex. Removing this failure marker file should allow this shard to be allocated again.

4 Likes

finally got the the data of the corrupted index. the initial state of my development elasticsearch instance was red. when i removed the corrupted_* file, restarted elasticsearch, it worked.

to sum it up, the solution was, fixing the index with lucene's CheckIndex tool, removing the corrupted_* and a restart of elasticsearch

thank you very much for your support :bouquet:

3 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.