We're in the process of upgrading the firmware on a number of Debian servers hosting our 12-node Elasticsearch cluster; we've been taking down one server at a time, always making sure the ES cluster was back to green before taking down the next server. Yesterday something went wrong because when the server came up and the Elasticsearch node was restarted and joined the cluster, one of the shards failed to recover. The error was
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=pfmhnn actual=b58uh9 (resource=name [_cqp.nvd], length , checksum [pfmhnn], writtenBy [6.5.1]) (resource=VerifyingIndexOutput(_cqp.nvd))
I have two questions:
Is it possible to force a shard to recover?
Given that we're not indexing heavily to the failed index, is there a way to force ES to accept one of the shards - we have one primary and one replica showing up as UNASSIGNED - to recover and become active again? I know this may cause some data loss but for many of our indices a little data loss is better than having the entire index remaining in a red state.
Could this failure be due to large shards?
I know shards are supposed to stay smaller than approx 50G and in this case we underestimated the document size and ended up with shards at 250G. It had worked fine though, both for indexing and searches, but I fear the large size may be part of the cause for the failed recovery.
With kind regards,