Failed to recover from translog CurrentState[CLOSED]

The cluster has unassigned shards possibly due to shard corruption. The process on the node was flaky due to OOM errors and was frequently starting. Disk space was sufficient on the node. I'm yet to verify the hardware failure. But wanted to understand what other events could cause the below. The first error shows unassigned which looks like primary died before replica could recover. Is unexpected error related to networking failure and could it cause translog corruption

[2019-07-10T04:28:00,129][ERROR][o.e.c.a.s.ShardStateAction] [NgOTVvC] [some_index_2019.07.09][3] unexpected failure while failing shard [shard id [[some_index_2019.07.09][3]], allocation id [rGR2c936RruHuOwdXwKKmQ], primary term [10], message [failed to perform indices:data/write/bulk[s] on replica [some_index_2019.07.09][3], node[YWO9RoPHSXyxPcthgcXILQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=rGR2c936RruHuOwdXwKKmQ], unassigned_info[[reason=PRIMARY_FAILED], at[2019-07-10T04:10:11.258Z], delayed=false, details[primary failed while replica initializing], allocation_status[no_attempt]]], failure [NodeDisconnectedException[[YWO9RoP][172.xx.xx.xx:9300][indices:data/write/bulk[s][r]] disconnected]], markAsStale [true]]    
[2019-07-10T08:43:58,732][o.e.i.e.Engine           ] [c67LMTy] [some_index_2019.07.09][3] failed engine [failed to recover from translog]
    org.elasticsearch.index.engine.EngineException: failed to recover from translog
            at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:406) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:377) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:98) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1297) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:420) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:301) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1567) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$5(IndexShard.java:2020) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:635) [elasticsearch-6.3.1.jar:6.3.1]
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
            at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
    Caused by: org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[CLOSED] operation only allowed when recovering, origin [LOCAL_TRANSLOG_RECOVERY]
            at org.elasticsearch.index.shard.IndexShard.ensureWriteAllowed(IndexShard.java:1444) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:674) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1236) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.IndexShard.runTranslogRecovery(IndexShard.java:1265) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:404) ~[elasticsearch-6.3.1.jar:6.3.1]
            ... 13 more

Can someone please take a look I am more than happy to provide more details

This doesn't indicate a corruption, no, this indicates that a primary shard was recovering some missing operations from its local translog and while that recovery was happening the allocation of that primary shard was cancelled. This can happen if the node holding this shard left the cluster and then joined the cluster again while the primary was recovering, which could be because of a network issue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.