CorruptIndexException after node restart

Hi,

We're in the process of upgrading the firmware on a number of Debian servers hosting our 12-node Elasticsearch cluster; we've been taking down one server at a time, always making sure the ES cluster was back to green before taking down the next server. Yesterday something went wrong because when the server came up and the Elasticsearch node was restarted and joined the cluster, one of the shards failed to recover. The error was

org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=pfmhnn actual=b58uh9 (resource=name [_cqp.nvd], length [4903027], checksum [pfmhnn], writtenBy [6.5.1]) (resource=VerifyingIndexOutput(_cqp.nvd))

I have two questions:

  1. Is it possible to force a shard to recover?
    Given that we're not indexing heavily to the failed index, is there a way to force ES to accept one of the shards - we have one primary and one replica showing up as UNASSIGNED - to recover and become active again? I know this may cause some data loss but for many of our indices a little data loss is better than having the entire index remaining in a red state.

  2. Could this failure be due to large shards?
    I know shards are supposed to stay smaller than approx 50G and in this case we underestimated the document size and ended up with shards at 250G. It had worked fine though, both for indexing and searches, but I fear the large size may be part of the cause for the failed recovery.

With kind regards,
Bernt Rostad
Retriever Norge

1 - Yes with a force allocate, but it will probably result in data loss.
2 - Unlikely.

What version are you on?

Thank you for the quick reply!

We're running Elasticsearch 5.4.1 and have 11TB of data in the cluster (990 M docs).

I will take a look at the force allocate function :slight_smile:

Corruption could be from the disk/hardware level. I don't know enough about this to properly comment though, so perhaps one of the other devs will jump in :slight_smile:

Yes, that is also a possibility so I'm following several paths of enquiry to see if I can learn why this happened.

The failed shard is not critical in this case but could be so in a future situation so I'd like to learn as much from this incident as possible, both how to recover unassigned shards and prevent this from happening again.

Thanks for your help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.