Corrupted elastic index

matw · June 14, 2018, 1:32pm

my instance of elasticsearch (5.6.1) stopped indexing because of the following error

[2018-06-13T16:39:07,413][ERROR][o.e.i.e.InternalEngine$EngineMergeScheduler] [tracer-node-1] [tracer-default-logs-network-2018.06.13][0] failed to merge org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=e0722284 actual=829b88e5 .....

when i looked into the index folder on the filesystem i found the following file:

-rw-r--r-- 1 elasticsearch elasticsearch 2.1K Jun 13 16:39 corrupted_hhX5NOZcSAqV8Mir0_Lfjw

it's content:

 ?×l^W^Estore^@^@^@^Bù^O^A^Q^A&failed engine (reason: [merge failed])^A^A^AGchecksum failed (hardware problem?) : expected=e0722284 actual=829b88e5^A<8c>^ABufferedChecksumIndexInput(MMapIndexInput(path="/var/lib/elasticsearch/nodes/0/indices/F_s-K-1YSFmmZOa3Z4ZMig/0/index/_8uo_Lucene50_0.tim"))^@^N"org.apache.lucene.codecs.CodecUtil^A^NCodecUtil.java^KcheckFooter£^C"org.apache.lucene.codecs.CodecUtil^A^NCodecUtil.java^RchecksumEntireFile<8e>^D7org.apache.lucene.codecs.blocktree.BlockTreeTermsReader^A^YBlockTreeTermsReader.java^NcheckIntegrityÐ^BEorg.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader^A^[PerFieldPostingsFormat.java^NcheckIntegrityÜ^BIorg.apache.lucene.codecs.perfield.PerFieldMergeState$FilterFieldsProducer^A^WPerFieldMergeState.java^NcheckIntegrity<8f>^B'org.apache.lucene.codecs.FieldsConsumer^A^SFieldsConsumer.java^Emerge`Eorg.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter^A^[PerFieldPostingsFormat.java^Emerge¤^A%org.apache.lucene.index.SegmentMerger^A^RSegmentMerger.java
mergeTermsØ^A%org.apache.lucene.index.SegmentMerger^A^RSegmentMerger.java^Emergee#org.apache.lucene.index.IndexWriter^A^PIndexWriter.java^KmergeMiddle<84>"#org.apache.lucene.index.IndexWriter^A^PIndexWriter.java^EmergeÛ^^0org.apache.lucene.index.ConcurrentMergeScheduler^A^]ConcurrentMergeScheduler.java^GdoMergeð^DDorg.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler^A*ElasticsearchConcurrentMergeScheduler.java^GdoMergec<org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread^A^]ConcurrentMergeScheduler.java^Crun<95>^E^@^G%org.elasticsearch.index.engine.Engine^A^KEngine.java
failEngine¯^FDorg.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2^A^SInternalEngine.java^EdoRun<90>^LXorg.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable^A^RThreadContext.java^EdoRunþ^D9org.elasticsearch.common.util.concurrent.AbstractRunnable^A^UAbstractRunnable.java^Crun%'java.util.concurrent.ThreadPoolExecutor^A^WThreadPoolExecutor.java     runWorkerö^H.java.util.concurrent.ThreadPoolExecutor$Worker^A^WThreadPoolExecutor.java^Cruné^D^Pjava.lang.Thread^A^KThread.java^Cruné^E^@À(<93>è^@^@^@^@^@^@^@^@K.OÁ

~

i tried to check and fix the index by

java -cp lucene-core*.jar -ea:org.apache.lucene… org.apache.lucene.index.CheckIndex /var/lib/elasticsearch/nodes/0/indices/F_s-K-1YSFmmZOa3Z4ZMig/1/index -verbose -exorcise

restarted elasticsearch, cluster was still red

then i moved the index out of the elasticsearch folder, after restart the cluster worked again

so, how can i fix this index? what could be the reason for this failure? thanks!

thiago · June 16, 2018, 5:02am

This is an indication of either hardware errors (check dmesg) or that it maybe ran out of disk space.

At this point segments are corrupted and data is lost, meaning that you can't recover the whole index anymore. Unless you have a snapshot (which is recommended for production).

There are a couple of options to try to partially recover this index:

Try to partially recover the corrupted shard:
1. Close the index.
2. Set index.shard.check_on_startup: fix for this index.
3. Open the index. At this time index will start to be verified and may take a long time.
4. If it recovers, then you need to redo step 1 to 3 but set index.shard.check_on_startup: false otherwise it will always try to fix when it opens again.
If shard can't be partially recovered then the only way is to completely drop it so at least the index can be recovered with the other healthy shards. For doing that you could try the allocate_empty_primary command of Cluster Reroute API.

None of these are guaranteed to work as it is highly dependent of the type of damage.

matw · June 18, 2018, 7:50am

thank you, i'll try it out and give feedback

ywelsch · June 18, 2018, 7:59am

Note that the corruption marker file corrupted_* will prevent the shard from being allocated as primary. This file is managed by Elasticsearch, and is unaware of the fact that you've fixed the index using Lucene's CheckIndex. Removing this failure marker file should allow this shard to be allocated again.

matw · June 22, 2018, 2:53pm

finally got the the data of the corrupted index. the initial state of my development elasticsearch instance was red. when i removed the corrupted_* file, restarted elasticsearch, it worked.

to sum it up, the solution was, fixing the index with lucene's CheckIndex tool, removing the corrupted_* and a restart of elasticsearch

thank you very much for your support

system · July 20, 2018, 2:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Checksum failed (hardware problem?) Elasticsearch	3	726	February 8, 2023
Lucene commit failed: checksum failed, while indexing Elasticsearch	4	1927	January 17, 2021
["org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) Elasticsearch docker , language-clients	1	224	March 18, 2024
Index status changed from yellow to red Elasticsearch	1	392	November 1, 2019
Index shards unassigned error Elasticsearch	2	652	February 15, 2017

Corrupted elastic index

Related topics