ES upgrade shards issue - ES 5.6.10 to ES 6.3

sureshvytla · June 21, 2018, 2:42pm

Team,

We have two node ES cluster which has been upgraded to 5.6.10 and we are having issue with shards in upgrading ES 5.6.10 to ES 6.3, in the both models either Rolling upgrade or full cluster restart we couldn't success.

[2018-06-20T06:06:00,590][WARN ][o.e.c.r.a.AllocationService] [host01.xxx] failing shard [failed shard, shard [my-201708][2], node[s-PbDSILSmKRYlTO4_FOvw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=aqQb_CrCS-WmyRcrSFtF6w], unassigned_info[[reason=ALLOCATION_FAILED], at[2018-06-20T06:05:59.714Z], failed_attempts[4], delayed=false, details[failed shard on node [s-PbDSILSmKRYlTO4_FOvw]: failed recovery, failure RecoveryFailedException[[my-201708][2]: Recovery failed from {host01.xxx}{wsV326J5QxGdk31A7XY38w}{dhXjygn0Sg-tcKICCSRkiw}{host01.xxx}{150.0.1.242:9300}{ml.machine_memory=32898998272, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} into {host02.xxx}{s-PbDSILSmKRYlTO4_FOvw}{VAJjb_PvTbe03nBZE_HBKA}{host02.xxx}{150.0.1.149:9300}{ml.machine_memory=32898998272, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: RemoteTransportException[[host01.xxx][150.0.1.242:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[host02.xxx][150.0.1.149:9300][internal:index/shard/recovery/prepare_translog]]; nested: IllegalStateException[commit doesn't contain history uuid]; ], allocation_status[no_attempt]], message [failed recovery], failure [RecoveryFailedException[[my-201708][2]: Recovery failed from {host01.xxx}{wsV326J5QxGdk31A7XY38w}{dhXjygn0Sg-tcKICCSRkiw}{host01.xxx}{150.0.1.242:9300}{ml.machine_memory=32898998272, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} into {host02.xxx}{s-PbDSILSmKRYlTO4_FOvw}{VAJjb_PvTbe03nBZE_HBKA}{host02.xxx}{150.0.1.149:9300}{ml.machine_memory=32898998272, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: RemoteTransportException[[host01.xxx][150.0.1.242:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[host02.xxx][150.0.1.149:9300][internal:index/shard/recovery/prepare_translog]]; nested: IllegalStateException[commit doesn't contain history uuid]; ], markAsStale [true]]
org.elasticsearch.indices.recovery.RecoveryFailedException: [my-201708][2]: Recovery failed from {host01.xxx}{wsV326J5QxGdk31A7XY38w}{dhXjygn0Sg-tcKICCSRkiw}{host01.xxx}{150.0.1.242:9300}{ml.machine_memory=32898998272, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} into {host02.xxx}{s-PbDSILSmKRYlTO4_FOvw}{VAJjb_PvTbe03nBZE_HBKA}{host02.xxx}{150.0.1.149:9300}{ml.machine_memory=32898998272, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.transport.RemoteTransportException: [host01.xxx][150.0.1.242:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] prepare target for translog failed
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:191) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:246) ~[?:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:304) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592) ~[elasticsearch-6.3.0.jar:6.3.0]
	... 5 more

We would like to upgrade to ES 6 to resolve CVEs in lucene, please advice on this.

Thanks,
Suresh Vytla

nhat · June 22, 2018, 4:08am

@sureshvytla

Sadly, this is a known issue. The fix will be included in 6.3.1. A workaround, in this case, is to rebuild the replicas of the offending index (i.e. my-201708). This can be done by changing to the number of replicas to 0, then restoring it to the original value.

Step 1:

PUT /my-201708/_settings
{
    "index" : {
        "number_of_replicas" : 0
    }
}

Step 2:

PUT /my-201708/_settings
{
    "index" : {
        "number_of_replicas" : 1 // the original value
    }
}

Hope this helps.

sureshvytla · June 22, 2018, 3:57pm

@nhat - Thank you so much for the detailed explanation and we would like to go with 6.3.1 release, is there any ETA on ES 6.3.1 release.

nhat · June 22, 2018, 4:12pm

@sureshvytla We are working on the new release but the date/time is still unknown.

nhat · June 25, 2018, 2:52pm

For the future readers, @bleskes has a cleaner workaround for this issue. It consists of these two steps:

Step 1: Force flush the offending index POST /my-201708/_flush?force=true
Step 2: Retry the cluster allocation POST /_cluster/reroute?retry_failed

system · July 23, 2018, 2:52pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Replica Shard is in unallocated state after upgrade to 6.0 from 5.6.0 Elasticsearch	7	1939	January 3, 2018
Help! After upgrading to Elasticsearch 6 cluster shard replicas will not allocate Elasticsearch	9	5939	December 25, 2017
Problem on shard allocation when upgrading from 1.2.2 to 1.30 Elasticsearch	2	415	July 6, 2017
Problems upgrading to 1.5.0 Elasticsearch	1	420	July 6, 2017
Elasticsearch Frequently Failed and Slows Down After Upgrade to 6.4 Elasticsearch	4	1178	October 11, 2018

ES upgrade shards issue - ES 5.6.10 to ES 6.3

Related topics