Hi,
I'm working on a 6.2.4 ES cluster with 8 VMs (2 physical servers).
Each VM has 28Go of memory, and 15Go are dedicated to ES.
My cluster contains 80 billions documents ( Store size = 16.11 TB / Fielddata size = 21.4GB), split across 600 daily indexes.
Each index has 2 shards and 1 replica.
Everyday, my system archives an old index by moving it from a SSD storage instance, to a HDD storage instance.
4 VMs are storing data on SSD, 4 other VMs are storing data on HDD.
The last ones are used to store older indexes.
Today I encountered an error :
[2018-08-08T12:04:35,814][WARN ][o.e.i.c.IndicesClusterStateService] [vm-8] [[index-2018-02-08][1]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [index-2018-02-08][1]: Recovery failed from {vm-6}{wPINhZZaTZmGwanoUW5rmA}{vV2Ob95NR9qtZspT_d2BsQ}{IP}{IP:PORT}{physical_server=physical-s2, storage=SSD} into {vm-8}{Bbz4NRC4SQ2h5GyzK5NQYw}{0GuO0WsRR0eSlu_TxE-A_w}{IP}{IP:PORT}{storage=HDD, physical_server=physical-s2}
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:288) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:81) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:635) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.4.jar:6.2.4]
...
Caused by: org.elasticsearch.transport.RemoteTransportException: [vm-6][IP:PORT][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:175) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.2.4.jar:6.2.4]
...
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [104] files with total size of [8.8gb]
at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:419) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:173) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[elasticsearch-6.2.4.jar:6.2.4]
...
Caused by: org.elasticsearch.transport.RemoteTransportException: [vm-8][IP:PORT][internal:index/shard/recovery/file_chunk]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [11260978100/10.4gb], which is larger than the limit of [11243782144/10.4gb]
at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:230) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1502) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1382) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:64) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
...
It seems that ES tries to send 104 files (8.8gb) to the HDD instance, but the total exceeds the defined limit (10.4gb)
Do you know if I can limit the number of files sent in a batch, or if I can increase the limit?
Has anyone encountered this problem?
Thanks for your help,
Greetings,
Guillaume.