Try to recover [test-20181128][2] from primary shard with sync id but number of docs differ: 59432 (10.1.1.189, primary) vs 60034(10.1.1.190)


(Jiankunking) #1

We use three ES data nodes with setting -Xmx30g -Xms30g. The three ES servers have 128G physical memory and 32 CPU cores.

the ES version is 5.4.1.

The following exception was found in the log today:
Caused by: java.lang.IllegalStateException: try to recover [test-20181128][2] from primary shard with sync id but number of docs differ: 59432 (10.1.1.189, primary) vs 60034(10.1.1.190) at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:226) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1528) ~[elasticsearch-5.4.1.jar:5.4.1] ... 5 more

i do not understand why the number of documents to be shard is less than the number of copies?


(David Turner) #2

These symptoms could be explained by any of these three issues, all fixed in 6.3.0. In the meantime you can recover this index by rebuilding its replicas: set number_of_replicas to 0, wait for the replicas to be deleted, and then set it back to its current value to create them again.