I have performed a rolling upgrade on my ES cluster to upgrade it from 5.6.0 to 6.0.0. The upgrade went smoothly and all of my nodes were upgraded.
Post upgrade, I noticed that a single replica shard of an index was unallocated with an ERROR, so I used the reroute API with retry_failed
to try and allocate this shard. The allocation failed again and the node logged the following error:
[2017-11-20T16:42:31,679][WARN ][o.e.i.c.IndicesClusterStateService] [esnode2] [[http-2017.11.20][1]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [http-2017.11.20][1]: Recovery failed from {esnode1}{sLciS6igSYCduQZ65pa8YQ}{mecNh3TBToyO7VDKgNEetQ}{10.44.0.46}{10.44.0.46:9201} into {esnode2}{N55B5iQBQGy6B7YqpudtRw}{uv8NvPcOT0WVPwnsdzwe5w}{10.44.0.47}{10.44.0.47:9201}
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:75) [elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:617) [elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.0.0.jar:6.0.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.elasticsearch.transport.RemoteTransportException: [esnode1][10.44.0.46:9201][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[2] phase2 failed
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:194) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1540) ~[elasticsearch-6.0.0.jar:6.0.0]
... 5 more
Caused by: org.elasticsearch.transport.RemoteTransportException: [esnode2][10.44.0.47:9201][internal:index/shard/recovery/translog_ops]
Caused by: org.elasticsearch.index.translog.TranslogException: Failed to write operation [Index{id='AV_Y6q4h5gnndJiKtcQn', type='bro', seqNo=-2, primaryTerm=0}]
at org.elasticsearch.index.translog.Translog.add(Translog.java:520) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:708) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:727) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:696) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1214) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.indices.recovery.RecoveryTarget.indexTranslogOperations(RecoveryTarget.java:395) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:442) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:433) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1540) ~[elasticsearch-6.0.0.jar:6.0.0]
... 5 more
Caused by: java.lang.IllegalArgumentException: sequence number must be assigned
at org.elasticsearch.index.seqno.SequenceNumbers.min(SequenceNumbers.java:90) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.index.translog.TranslogWriter.add(TranslogWriter.java:202) ~[elasticsearch-6.0.0.jar:6.0.0]
... some more traceback
Apparently there is some problem with the sequence number. As this was only a replica shard, I removed and re-added replicas to work around the problem. However, does anyone have any idea why this error came up?