Corrupted elastic index


(Matthias Wilhelm) #1

my instance of elasticsearch (5.6.1) stopped indexing because of the following error

[2018-06-13T16:39:07,413][ERROR][o.e.i.e.InternalEngine$EngineMergeScheduler] [tracer-node-1] [tracer-default-logs-network-2018.06.13][0] failed to merge org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=e0722284 actual=829b88e5 .....

when i looked into the index folder on the filesystem i found the following file:

-rw-r--r-- 1 elasticsearch elasticsearch 2.1K Jun 13 16:39 corrupted_hhX5NOZcSAqV8Mir0_Lfjw

it's content:

 ?√ól^W^Estore^@^@^@^B√Ļ^O^A^Q^A&failed engine (reason: [merge failed])^A^A^AGchecksum failed (hardware problem?) : expected=e0722284 actual=829b88e5^A<8c>^ABufferedChecksumIndexInput(MMapIndexInput(path="/var/lib/elasticsearch/nodes/0/indices/F_s-K-1YSFmmZOa3Z4ZMig/0/index/_8uo_Lucene50_0.tim"))^@^N"org.apache.lucene.codecs.CodecUtil^A^NCodecUtil.java^KcheckFooter¬£^C"org.apache.lucene.codecs.CodecUtil^A^NCodecUtil.java^RchecksumEntireFile<8e>^D7org.apache.lucene.codecs.blocktree.BlockTreeTermsReader^A^YBlockTreeTermsReader.java^NcheckIntegrity√ź^BEorg.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader^A^[PerFieldPostingsFormat.java^NcheckIntegrity√ú^BIorg.apache.lucene.codecs.perfield.PerFieldMergeState$FilterFieldsProducer^A^WPerFieldMergeState.java^NcheckIntegrity<8f>^B'org.apache.lucene.codecs.FieldsConsumer^A^SFieldsConsumer.java^Emerge`Eorg.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter^A^[PerFieldPostingsFormat.java^Emerge¬§^A%org.apache.lucene.index.SegmentMerger^A^RSegmentMerger.java
mergeTerms√ė^A%org.apache.lucene.index.SegmentMerger^A^RSegmentMerger.java^Emergee#org.apache.lucene.index.IndexWriter^A^PIndexWriter.java^KmergeMiddle<84>"#org.apache.lucene.index.IndexWriter^A^PIndexWriter.java^Emerge√õ^^0org.apache.lucene.index.ConcurrentMergeScheduler^A^]ConcurrentMergeScheduler.java^GdoMerge√į^DDorg.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler^A*ElasticsearchConcurrentMergeScheduler.java^GdoMergec<org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread^A^]ConcurrentMergeScheduler.java^Crun<95>^E^@^G%org.elasticsearch.index.engine.Engine^A^KEngine.java
failEngine¬Į^FDorg.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2^A^SInternalEngine.java^EdoRun<90>^LXorg.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable^A^RThreadContext.java^EdoRun√ĺ^D9org.elasticsearch.common.util.concurrent.AbstractRunnable^A^UAbstractRunnable.java^Crun%'java.util.concurrent.ThreadPoolExecutor^A^WThreadPoolExecutor.java     runWorker√∂^H.java.util.concurrent.ThreadPoolExecutor$Worker^A^WThreadPoolExecutor.java^Crun√©^D^Pjava.lang.Thread^A^KThread.java^Crun√©^E^@√Ä(<93>√®^@^@^@^@^@^@^@^@K.O√Ā

~

i tried to check and fix the index by

java -cp lucene-core*.jar -ea:org.apache.lucene… org.apache.lucene.index.CheckIndex /var/lib/elasticsearch/nodes/0/indices/F_s-K-1YSFmmZOa3Z4ZMig/1/index -verbose -exorcise

restarted elasticsearch, cluster was still red

then i moved the index out of the elasticsearch folder, after restart the cluster worked again

so, how can i fix this index? what could be the reason for this failure? thanks!


(Thiago Souza) #2

This is an indication of either hardware errors (check dmesg) or that it maybe ran out of disk space.

At this point segments are corrupted and data is lost, meaning that you can't recover the whole index anymore. Unless you have a snapshot (which is recommended for production).

There are a couple of options to try to partially recover this index:

  1. Try to partially recover the corrupted shard:
    1. Close the index.
    2. Set index.shard.check_on_startup: fix for this index.
    3. Open the index. At this time index will start to be verified and may take a long time.
    4. If it recovers, then you need to redo step 1 to 3 but set index.shard.check_on_startup: false otherwise it will always try to fix when it opens again.
  2. If shard can't be partially recovered then the only way is to completely drop it so at least the index can be recovered with the other healthy shards. For doing that you could try the allocate_empty_primary command of Cluster Reroute API.

None of these are guaranteed to work as it is highly dependent of the type of damage.


Red status caused by stuck initializing_shards
(Matthias Wilhelm) #3

thank you, i'll try it out and give feedback


(Yannick Welsch) #4

Note that the corruption marker file corrupted_* will prevent the shard from being allocated as primary. This file is managed by Elasticsearch, and is unaware of the fact that you've fixed the index using Lucene's CheckIndex. Removing this failure marker file should allow this shard to be allocated again.