One of the primary shard of the index is corrupt

Hi, I am getting this error. Is there any way to restore this shard?

/_cluster/allocation/explain?pretty 
{
  "index" : "es-document-000004",
  "shard" : 2,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2023-03-31T20:10:08.007Z",
    "failed_allocation_attempts" : 4,
    "details" : "failed shard on node [ULQc2PtCRgmAZCeqhYYzNg]: shard failure, reason [corrupt file (source: [start])], failure java.nio.file.FileSystemException: /data/indices/nAINL_GfRy2W91poWjOPBQ/2/index/_jx.fdx: Too many open files\n\tat sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)\n\tat sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)\n\tat sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)\n\tat sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181)\n\tat java.nio.channels.FileChannel.open(FileChannel.java:298)\n\tat java.nio.channels.FileChannel.open(FileChannel.java:357)\n\tat org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78)\n\tat org.elasticsearch.index.store.FsDirectoryFactory$HybridDirectory.openInput(FsDirectoryFactory.java:124)\n\tat org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101)\n\tat org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101)\n\tat org.apache.lucene.codecs.lucene90.compressing.FieldsIndexReader.<init>(FieldsIndexReader.java:68)\n\tat org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.<init>(Lucene90CompressingStoredFieldsReader.java:166)\n\tat org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsFormat.fieldsReader(Lucene90CompressingStoredFieldsFormat.java:133)\n\tat org.apache.lucene.codecs.lucene90.Lucene90StoredFieldsFormat.fieldsReader(Lucene90StoredFieldsFormat.java:136)\n\tat org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:138)\n\tat org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:91)\n\tat org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:179)\n\tat org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:221)\n\tat org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:535)\n\tat org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:138)\n\tat org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:597)\n\tat org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:112)\n\tat org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:91)\n\tat org.elasticsearch.index.engine.InternalEngine.createReaderManager(InternalEngine.java:617)\n\tat org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:245)\n\tat org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:191)\n\tat org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:14)\n\tat org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1930)\n\tat org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1894)\n\tat org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:461)\n\tat org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:88)\n\tat org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:462)\n\tat org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:86)\n\tat org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2239)\n\tat org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.lang.Thread.run(Thread.java:833)\n\tSuppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (4f805c91). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path=\"/data/indices/nAINL_GfRy2W91poWjOPBQ/2/index/_jx.fdm\")))\n\t\tat org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:500)\n\t\tat org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.<init>(Lucene90CompressingStoredFieldsReader.java:208)\n\t\t... 28 more\n",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "Elasticsearch can't allocate this shard because all the copies of its data in the cluster are stale or corrupt. Elasticsearch will allocate this shard when a node containing a good copy of its data joins the cluster. If no such node is available, restore this index from a recent snapshot.",
  "node_allocation_decisions" : [
    {
      "node_id" : "ULQc2PtCRgmAZCeqhYYzNg",
      "node_name" : "data2",
      "transport_address" : "x.x.x.x:9300",
      "node_attributes" : {
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "RGteemkTTDCvVgTehBo0JA",
        "store_exception" : {
          "type" : "corrupt_index_exception",
          "reason" : "failed engine (reason: [corrupt file (source: [start])]) (resource=preexisting_corruption)",
          "caused_by" : {
            "type" : "i_o_exception",
            "reason" : "failed engine (reason: [corrupt file (source: [start])])",
            "caused_by" : {
              "type" : "corrupt_index_exception",
              "reason" : "checksum passed (4f805c91). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path=\"/data/indices/nAINL_GfRy2W91poWjOPBQ/2/index/_jx.fdm\")))"
            }
          }
        }
      }
    },
    {
      "node_id" : "wbTIeBzwR1ODyoeTs9nQ2w",
      "node_name" : "data1",
      "transport_address" : "x.x.x.x:9300",
      "node_attributes" : {
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : false,
        "allocation_id" : "SrLEjFdsCKR3ucQbvK64jQkg"
      }
    }
  ]
}

Hi,

You appear to have two options:

  1. Delete the index and restore from snapshot
  2. Have a replica shard and delete the primary shard

However, based on the error, if you have a node with a replica shard the second options should automatically kick-in and auto correct the issue.

Which leaves you with restoring from snapshot if you have one.

You could attempt a restart of the elasticsearch node itself in the hopes a " have you tried turning it off and on again" resolves it, but i wouldn't bet any money on it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.