CorruptIndexException after node restart

Bernt_Rostad · August 29, 2017, 8:11am

Hi,

We're in the process of upgrading the firmware on a number of Debian servers hosting our 12-node Elasticsearch cluster; we've been taking down one server at a time, always making sure the ES cluster was back to green before taking down the next server. Yesterday something went wrong because when the server came up and the Elasticsearch node was restarted and joined the cluster, one of the shards failed to recover. The error was

org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=pfmhnn actual=b58uh9 (resource=name [_cqp.nvd], length [4903027], checksum [pfmhnn], writtenBy [6.5.1]) (resource=VerifyingIndexOutput(_cqp.nvd))

I have two questions:

Is it possible to force a shard to recover?
Given that we're not indexing heavily to the failed index, is there a way to force ES to accept one of the shards - we have one primary and one replica showing up as UNASSIGNED - to recover and become active again? I know this may cause some data loss but for many of our indices a little data loss is better than having the entire index remaining in a red state.
Could this failure be due to large shards?
I know shards are supposed to stay smaller than approx 50G and in this case we underestimated the document size and ended up with shards at 250G. It had worked fine though, both for indexing and searches, but I fear the large size may be part of the cause for the failed recovery.

With kind regards,
Bernt Rostad
Retriever Norge

warkolm · August 29, 2017, 8:12am

1 - Yes with a force allocate, but it will probably result in data loss.
2 - Unlikely.

What version are you on?

Bernt_Rostad · August 29, 2017, 8:15am

Thank you for the quick reply!

We're running Elasticsearch 5.4.1 and have 11TB of data in the cluster (990 M docs).

I will take a look at the force allocate function

warkolm · August 29, 2017, 8:16am

Corruption could be from the disk/hardware level. I don't know enough about this to properly comment though, so perhaps one of the other devs will jump in

Bernt_Rostad · August 29, 2017, 8:20am

Yes, that is also a possibility so I'm following several paths of enquiry to see if I can learn why this happened.

The failed shard is not critical in this case but could be so in a future situation so I'd like to learn as much from this incident as possible, both how to recover unassigned shards and prevent this from happening again.

Thanks for your help!

system · September 26, 2017, 8:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Corrupted Shard on Recovery Elasticsearch	10	690	July 6, 2017
Lucene commit failed: checksum failed, while indexing Elasticsearch	4	1923	January 17, 2021
Index corruption on cluster restart Elasticsearch	3	1318	July 6, 2017
Checksum failed (hardware problem?) Elasticsearch	3	726	February 8, 2023
Index corruption with .tim file checksum mismatch Elasticsearch	4	916	April 4, 2019

CorruptIndexException after node restart

Related topics