Hi,
I have this problem where elasticsearch is frequently failed and shard allocation after restart is very slow after upgrading to 6.4 from 6.2.
[2018-09-13T17:26:23,546][WARN ][o.e.i.c.IndicesClusterStateService] [ELK1-preprod] [[clientzone_signaturepos-2018.08.24][2]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [clientzone_signaturepos-2018.08.24][2]: Recovery failed from {ELK2-preprod}{X9uh1liyTGK2LWyMGv3BOA}{Ic15dOerSW6yGmygNTE-zw}{10.56.20.95}{10.56.20.95:9300}{ml.machine_memory=33568194560, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} into {ELK1-preprod}{ykLXYsriTPyQkkOQVKK41A}{GG-Vbx6lS1GHblvUE-QrNQ}{10.56.20.94}{10.56.20.94:9300}{ml.machine_memory=33568194560, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.4.0.jar:6.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Caused by: org.elasticsearch.transport.RemoteTransportException: [ELK2-preprod][10.56.20.95:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] prepare target for translog failed
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:191) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251) ~[?:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:309) ~[?:?]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) ~[elasticsearch-6.4.0.jar:6.4.0]
... 5 more
Caused by: org.elasticsearch.transport.RemoteTransportException: [ELK1-preprod][10.56.20.94:9300][internal:index/shard/recovery/prepare_translog]
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException: failed to create engine
at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:199) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:160) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-6.4.0.jar:6.4.0]
... 5 more
[2018-09-13T17:26:34,014][WARN ][o.e.g.DanglingIndicesState] [ELK1- preprod] [[clientzone_integrationconnector-2018.07.16/ubGfO0sGSCqV8LZxmkxMVQ]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
[2018-09-13T17:26:34,014][WARN ][o.e.g.DanglingIndicesState] [ELK1-preprod] [[error-partnerzone_partner-2018.07.16/fzlLkW57T6iZfTUzAhDgtw]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
Its been 7 hour and active_shards_percent_as_number is only 59% and cluster health is red.
Can someone give me advice on how to resolve this problem?