ES upgrade shards issue - ES 5.6.10 to ES 6.3

Team,

We have two node ES cluster which has been upgraded to 5.6.10 and we are having issue with shards in upgrading ES 5.6.10 to ES 6.3, in the both models either Rolling upgrade or full cluster restart we couldn't success.

[2018-06-20T06:06:00,590][WARN ][o.e.c.r.a.AllocationService] [host01.xxx] failing shard [failed shard, shard [my-201708][2], node[s-PbDSILSmKRYlTO4_FOvw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=aqQb_CrCS-WmyRcrSFtF6w], unassigned_info[[reason=ALLOCATION_FAILED], at[2018-06-20T06:05:59.714Z], failed_attempts[4], delayed=false, details[failed shard on node [s-PbDSILSmKRYlTO4_FOvw]: failed recovery, failure RecoveryFailedException[[my-201708][2]: Recovery failed from {host01.xxx}{wsV326J5QxGdk31A7XY38w}{dhXjygn0Sg-tcKICCSRkiw}{host01.xxx}{150.0.1.242:9300}{ml.machine_memory=32898998272, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} into {host02.xxx}{s-PbDSILSmKRYlTO4_FOvw}{VAJjb_PvTbe03nBZE_HBKA}{host02.xxx}{150.0.1.149:9300}{ml.machine_memory=32898998272, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: RemoteTransportException[[host01.xxx][150.0.1.242:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[host02.xxx][150.0.1.149:9300][internal:index/shard/recovery/prepare_translog]]; nested: IllegalStateException[commit doesn't contain history uuid]; ], allocation_status[no_attempt]], message [failed recovery], failure [RecoveryFailedException[[my-201708][2]: Recovery failed from {host01.xxx}{wsV326J5QxGdk31A7XY38w}{dhXjygn0Sg-tcKICCSRkiw}{host01.xxx}{150.0.1.242:9300}{ml.machine_memory=32898998272, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} into {host02.xxx}{s-PbDSILSmKRYlTO4_FOvw}{VAJjb_PvTbe03nBZE_HBKA}{host02.xxx}{150.0.1.149:9300}{ml.machine_memory=32898998272, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: RemoteTransportException[[host01.xxx][150.0.1.242:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[host02.xxx][150.0.1.149:9300][internal:index/shard/recovery/prepare_translog]]; nested: IllegalStateException[commit doesn't contain history uuid]; ], markAsStale [true]]
org.elasticsearch.indices.recovery.RecoveryFailedException: [my-201708][2]: Recovery failed from {host01.xxx}{wsV326J5QxGdk31A7XY38w}{dhXjygn0Sg-tcKICCSRkiw}{host01.xxx}{150.0.1.242:9300}{ml.machine_memory=32898998272, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} into {host02.xxx}{s-PbDSILSmKRYlTO4_FOvw}{VAJjb_PvTbe03nBZE_HBKA}{host02.xxx}{150.0.1.149:9300}{ml.machine_memory=32898998272, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.transport.RemoteTransportException: [host01.xxx][150.0.1.242:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] prepare target for translog failed
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:191) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:246) ~[?:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:304) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592) ~[elasticsearch-6.3.0.jar:6.3.0]
	... 5 more

We would like to upgrade to ES 6 to resolve CVEs in lucene, please advice on this.

Thanks,
Suresh Vytla

1 Like

@sureshvytla

Sadly, this is a known issue. The fix will be included in 6.3.1. A workaround, in this case, is to rebuild the replicas of the offending index (i.e. my-201708). This can be done by changing to the number of replicas to 0, then restoring it to the original value.

Step 1:

PUT /my-201708/_settings
{
    "index" : {
        "number_of_replicas" : 0
    }
}

Step 2:

PUT /my-201708/_settings
{
    "index" : {
        "number_of_replicas" : 1 // the original value
    }
}

Hope this helps.

@nhat - Thank you so much for the detailed explanation and we would like to go with 6.3.1 release, is there any ETA on ES 6.3.1 release.

@sureshvytla We are working on the new release but the date/time is still unknown.

For the future readers, @bleskes has a cleaner workaround for this issue. It consists of these two steps:

  • Step 1: Force flush the offending index POST /my-201708/_flush?force=true
  • Step 2: Retry the cluster allocation POST /_cluster/reroute?retry_failed

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.