Failed to recover from translog CurrentState[CLOSED]

Bukhtawar_Khan · July 27, 2019, 3:40pm

The cluster has unassigned shards possibly due to shard corruption. The process on the node was flaky due to OOM errors and was frequently starting. Disk space was sufficient on the node. I'm yet to verify the hardware failure. But wanted to understand what other events could cause the below. The first error shows unassigned which looks like primary died before replica could recover. Is unexpected error related to networking failure and could it cause translog corruption

[2019-07-10T04:28:00,129][ERROR][o.e.c.a.s.ShardStateAction] [NgOTVvC] [some_index_2019.07.09][3] unexpected failure while failing shard [shard id [[some_index_2019.07.09][3]], allocation id [rGR2c936RruHuOwdXwKKmQ], primary term [10], message [failed to perform indices:data/write/bulk[s] on replica [some_index_2019.07.09][3], node[YWO9RoPHSXyxPcthgcXILQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=rGR2c936RruHuOwdXwKKmQ], unassigned_info[[reason=PRIMARY_FAILED], at[2019-07-10T04:10:11.258Z], delayed=false, details[primary failed while replica initializing], allocation_status[no_attempt]]], failure [NodeDisconnectedException[[YWO9RoP][172.xx.xx.xx:9300][indices:data/write/bulk[s][r]] disconnected]], markAsStale [true]]

[2019-07-10T08:43:58,732][o.e.i.e.Engine           ] [c67LMTy] [some_index_2019.07.09][3] failed engine [failed to recover from translog]
    org.elasticsearch.index.engine.EngineException: failed to recover from translog
            at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:406) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:377) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:98) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1297) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:420) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:301) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1567) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$5(IndexShard.java:2020) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:635) [elasticsearch-6.3.1.jar:6.3.1]
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
            at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
    Caused by: org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[CLOSED] operation only allowed when recovering, origin [LOCAL_TRANSLOG_RECOVERY]
            at org.elasticsearch.index.shard.IndexShard.ensureWriteAllowed(IndexShard.java:1444) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:674) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1236) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.shard.IndexShard.runTranslogRecovery(IndexShard.java:1265) ~[elasticsearch-6.3.1.jar:6.3.1]
            at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:404) ~[elasticsearch-6.3.1.jar:6.3.1]
            ... 13 more

Bukhtawar_Khan · July 28, 2019, 5:37pm

Can someone please take a look I am more than happy to provide more details

DavidTurner · August 7, 2019, 12:51pm

This doesn't indicate a corruption, no, this indicates that a primary shard was recovering some missing operations from its local translog and while that recovery was happening the allocation of that primary shard was cancelled. This can happen if the node holding this shard left the cluster and then joined the cluster again while the primary was recovering, which could be because of a network issue.

system · September 4, 2019, 12:51pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic Translog corrupted error (Unassigned shards) Elasticsearch	2	169	November 15, 2023
How do we recover from corruption in transaction log file? Elasticsearch	4	479	July 19, 2022
Translog Corruption- AccessDeniedException -in ElasticSearch Elasticsearch	11	1463	November 8, 2017
Index stuck in yellow unable to assign replica due to translog corruption Elasticsearch	6	201	April 30, 2024
Elastic Search fails to restart because of high CPU usage and error failed to recover from translog Elasticsearch docker	1	794	July 30, 2020

Failed to recover from translog CurrentState[CLOSED]

Related topics