Shards Taking a Long Time to Move Between Nodes - Cloud [7.1.1]

DavidTurner · June 26, 2019, 12:05pm

I'm not sure how you found a blank string, but typically it won't be there at all indicating it's set to the default value. Which is fs which today means hybridfs which, in turn, means a combination of niofs and mmapfs that's appropriate for search. I have a suspicion that adjusting this blend towards mmapfs during recoveries might help here.

The stack traces you've shared indicate that the node is now spending more time doing TLS-related activities. They also suggest that it's sending chunks in parallel. Can you share another output from the transport logger like yesterday? Does it look like there's more concurrency here?

Can you find the recovery thread in the hot threads output? Even if it's not in the top n threads for CPU usage it'll still be illuminating to see where it's spending its time after moving to mmapfs.

ljefferies · June 26, 2019, 2:09pm

So sticking with the mmapfs I have seen this in the logs:

This is from the sending node.

[2019-06-26T14:01:58,099][TRACE][o.e.t.T.tracer           ] [warm-2] [1528347][internal:index/shard/recovery/file_chunk] received request
[2019-06-26T14:01:58,100][TRACE][o.e.t.T.tracer           ] [warm-2] [1528347][internal:index/shard/recovery/file_chunk] sent response
[2019-06-26T14:01:58,102][TRACE][o.e.t.T.tracer           ] [warm-2] [996686][internal:index/shard/recovery/file_chunk] sent to [{warm-0}{abc}{xyz-g}{***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=10, ml.max_open_jobs=20, fault_domain=0, xpack.installed=true, box_type=warm}] (timeout: [15m])
[2019-06-26T14:01:58,105][TRACE][o.e.t.T.tracer           ] [warm-2] [996686][internal:index/shard/recovery/file_chunk] received response from [{warm-0}{abc}{xyz-g}{***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=10, ml.max_open_jobs=20, fault_domain=0, xpack.installed=true, box_type=warm}]
[2019-06-26T14:01:58,108][TRACE][o.e.t.T.tracer           ] [warm-2] [1528349][internal:index/shard/recovery/file_chunk] received request
[2019-06-26T14:01:58,109][TRACE][o.e.t.T.tracer           ] [warm-2] [1528349][internal:index/shard/recovery/file_chunk] sent response
[2019-06-26T14:01:58,111][TRACE][o.e.t.T.tracer           ] [warm-2] [996687][internal:index/shard/recovery/file_chunk] sent to [{warm-0}{abc}{xyz-g}{***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=10, ml.max_open_jobs=20, fault_domain=0, xpack.installed=true, box_type=warm}] (timeout: [15m])
[2019-06-26T14:01:58,114][TRACE][o.e.t.T.tracer           ] [warm-2] [996687][internal:index/shard/recovery/file_chunk] received response from [{warm-0}{abc}{xyz-g}{***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=10, ml.max_open_jobs=20, fault_domain=0, xpack.installed=true, box_type=warm}]

And in the hot_threads I have seen this:

 0.2% (852micros out of 500ms) cpu usage by thread 'elasticsearch[warm-2][generic][T#8]'
 10/10 snapshots sharing following 37 elements
   app//org.apache.lucene.store.DataInput.readBytes(DataInput.java:87)
   app//org.elasticsearch.common.lucene.store.InputStreamIndexInput.read(InputStreamIndexInput.java:73)
   app//org.elasticsearch.indices.recovery.RecoverySourceHandler.sendFiles(RecoverySourceHandler.java:697)
   app//org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:414)
   app//org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:180)
   app//org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:106)
   app//org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:54)
   app//org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:112)
   app//org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:109)
   org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251)
   app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   app//org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:192)
   org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$0(SecurityServerTransportInterceptor.java:300)
   org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$$Lambda$3228/0x0000000840b0a840.accept(Unknown Source)
   app//org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61)
   org.elasticsearch.xpack.security.authz.AuthorizationService.authorizeSystemUser(AuthorizationService.java:375)
   org.elasticsearch.xpack.security.authz.AuthorizationService.authorize(AuthorizationService.java:183)
   org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.lambda$inbound$1(ServerTransportFilter.java:130)
   org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile$$Lambda$3230/0x0000000840b0b040.accept(Unknown Source)
   app//org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61)
   org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$2(AuthenticationService.java:246)
   org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator$$Lambda$3232/0x0000000840b0b840.accept(Unknown Source)
   org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$6(AuthenticationService.java:306)
   org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator$$Lambda$3233/0x0000000840b0bc40.run(Unknown Source)
   org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:317)
   org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:244)
   org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:196)
   org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:139)
   org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.inbound(ServerTransportFilter.java:121)
   org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:307)
   app//org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
   app//org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1077)
   app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
   app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   java.base@11.0.3/java.lang.Thread.run(Thread.java:834)

ljefferies · June 26, 2019, 2:11pm

The other threads look like this:

4.9% (24.6ms out of 500ms) cpu usage by thread 'elasticsearch[warm-2][transport_worker][T#13]'
 8/10 snapshots sharing following 2 elements
   io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
   java.base@11.0.3/java.lang.Thread.run(Thread.java:834)

0.5% (2.3ms out of 500ms) cpu usage by thread 'elasticsearch[warm-2][transport_worker][T#12]'
 2/10 snapshots sharing following 41 elements
   java.base@11.0.3/java.security.MessageDigestSpi.engineUpdate(MessageDigestSpi.java:115)
   java.base@11.0.3/java.security.MessageDigest$Delegate.engineUpdate(MessageDigest.java:617)
   java.base@11.0.3/java.security.MessageDigest.update(MessageDigest.java:373)
   java.base@11.0.3/com.sun.crypto.provider.HmacCore.engineUpdate(HmacCore.java:201)
   java.base@11.0.3/javax.crypto.Mac.update(Mac.java:554)
   java.base@11.0.3/sun.security.ssl.Authenticator$MacImpl.compute(Authenticator.java:512)
   java.base@11.0.3/sun.security.ssl.Authenticator$TLS10Mac.compute(Authenticator.java:581)
   java.base@11.0.3/sun.security.ssl.Authenticator$MAC.compute(Authenticator.java:444)
   java.base@11.0.3/sun.security.ssl.SSLCipher.addMac(SSLCipher.java:2089)
   java.base@11.0.3/sun.security.ssl.SSLCipher$T11BlockWriteCipherGenerator$BlockWriteCipher.encrypt(SSLCipher.java:1449)
   java.base@11.0.3/sun.security.ssl.OutputRecord.t10Encrypt(OutputRecord.java:366)
   java.base@11.0.3/sun.security.ssl.OutputRecord.encrypt(OutputRecord.java:274)
   java.base@11.0.3/sun.security.ssl.SSLEngineOutputRecord.encode(SSLEngineOutputRecord.java:274)
   java.base@11.0.3/sun.security.ssl.SSLEngineOutputRecord.encode(SSLEngineOutputRecord.java:175)
   java.base@11.0.3/sun.security.ssl.SSLEngineImpl.encode(SSLEngineImpl.java:274)
   java.base@11.0.3/sun.security.ssl.SSLEngineImpl.writeRecord(SSLEngineImpl.java:233)
   java.base@11.0.3/sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:136)
   java.base@11.0.3/sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:116)
   java.base@11.0.3/javax.net.ssl.SSLEngine.wrap(SSLEngine.java:519)
   io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:1023)
   io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:829)
   io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:797)
   io.netty.handler.ssl.SslHandler.flush(SslHandler.java:778)
   io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
   io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
   io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
   io.netty.handler.logging.LoggingHandler.flush(LoggingHandler.java:265)
   io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
   io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
   io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
   io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
   io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
   io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
   io.netty.channel.AbstractChannelHandlerContext.access$1500(AbstractChannelHandlerContext.java:38)
   io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1152)
   io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1075)
   io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
   io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
   io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:474)
   io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
   java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
 4/10 snapshots sharing following 8 elements
   java.base@11.0.3/sun.nio.ch.EPoll.wait(Native Method)
   java.base@11.0.3/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120)
   java.base@11.0.3/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
   java.base@11.0.3/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
   io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
   io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
   io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
   java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
 2/10 snapshots sharing following 20 elements
   io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:797)
   io.netty.handler.ssl.SslHandler.flush(SslHandler.java:778)
   io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
   io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
   io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
   io.netty.handler.logging.LoggingHandler.flush(LoggingHandler.java:265)
   io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
   io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
   io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
   io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
   io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
   io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
   io.netty.channel.AbstractChannelHandlerContext.access$1500(AbstractChannelHandlerContext.java:38)
   io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1152)
   io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1075)
   io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
   io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
   io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:474)
   io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
   java.base@11.0.3/java.lang.Thread.run(Thread.java:834)

ljefferies · June 26, 2019, 2:15pm

...

0.3% (1.3ms out of 500ms) cpu usage by thread 'elasticsearch[warm-2][generic][T#4]'
 10/10 snapshots sharing following 2 elements
   java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   java.base@11.0.3/java.lang.Thread.run(Thread.java:834)

0.1% (694.4micros out of 500ms) cpu usage by thread 'elasticsearch[warm-2][generic][T#9]'
 8/10 snapshots sharing following 2 elements
   java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   java.base@11.0.3/java.lang.Thread.run(Thread.java:834)

0.0% (27.6micros out of 500ms) cpu usage by thread 'ticker-schedule-trigger-engine'
10/10 snapshots sharing following 2 elements
java.base@11.0.3/java.lang.Thread.sleep(Native Method)
org.elasticsearch.xpack.watcher.trigger.schedule.engine.TickerScheduleTriggerEngine$Ticker.run(TickerScheduleTriggerEngine.java:193)

0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[keepAlive/7.1.1]'
10/10 snapshots sharing following 8 elements
    java.base@11.0.3/jdk.internal.misc.Unsafe.park(Native Method)
    java.base@11.0.3/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
    java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
    java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
    java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
    java.base@11.0.3/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
    app//org.elasticsearch.bootstrap.Bootstrap$1.run(Bootstrap.java:80)
    java.base@11.0.3/java.lang.Thread.run(Thread.java:834)

0.0% (0s out of 500ms) cpu usage by thread 'Connection evictor'
10/10 snapshots sharing following 3 elements
    java.base@11.0.3/java.lang.Thread.sleep(Native Method)
    org.apache.http.impl.client.IdleConnectionEvictor$1.run(IdleConnectionEvictor.java:66)
    java.base@11.0.3/java.lang.Thread.run(Thread.java:834)

0.0% (0s out of 500ms) cpu usage by thread 'process reaper'
10/10 snapshots sharing following 5 elements
    java.base@11.0.3/java.lang.ProcessHandleImpl.waitForProcessExit0(Native Method)
    java.base@11.0.3/java.lang.ProcessHandleImpl$1.run(ProcessHandleImpl.java:138)
    java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    java.base@11.0.3/java.lang.Thread.run(Thread.java:834)

DavidTurner · June 26, 2019, 3:39pm

Ok, the recovery source thread is using less CPU but it's still stuck on IO 10 times out of 10:

 0.2% (852micros out of 500ms) cpu usage by thread 'elasticsearch[warm-2][generic][T#8]'
 10/10 snapshots sharing following 37 elements
   app//org.apache.lucene.store.DataInput.readBytes(DataInput.java:87)
   app//org.elasticsearch.common.lucene.store.InputStreamIndexInput.read(InputStreamIndexInput.java:73)
   app//org.elasticsearch.indices.recovery.RecoverySourceHandler.sendFiles(RecoverySourceHandler.java:697)
   app//org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:414)
   app//org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:180)

It's looking to me that the problem lies outside Elasticsearch. Can you do any investigation into IO issues on these nodes? How to dig deeper depends on your OS and environment and is outside my area of expertise, but for instance how fast you can pull data off disk using something like dd?

ljefferies · June 27, 2019, 8:01am

So I ran this on one of the nodes (I assume this is what you mean't?):

Drive is mounted: /datadisks/disk1

sudo sync; sudo dd if=/datadisks/disk1/tempfile of=/dev/zero bs=250M count=200; sudo sync

200+0 records in
200+0 records out
52428800000 bytes (52 GB, 49 GiB) copied, 348.574 s, 150 MB/s

This looks like to me it can read from the disk at 150MB/s.

OS: Ubuntu 18.04.2 LTS
DISK: Azure P60 - 16000 IOPS - 500MB/s Throughput

DavidTurner · June 27, 2019, 11:32am

Network attached storage tends to have higher latency, and Elasticsearch's read sizes are much smaller than 250MB. How quickly can you stream 50GB of data from the disk with bs=16k instead?

ljefferies · June 27, 2019, 1:48pm

I some what think that what you have suggested is the issue.

sudo sync; sudo dd if=/datadisks/disk1/tempfile of=/dev/zero bs=16k count=3276800; sudo sync
3200000+0 records in
3200000+0 records out
52428800000 bytes (52 GB, 49 GiB) copied, 1296.23 s, 40.4 MB/s

Is there anyway to really fix this?

DavidTurner · June 27, 2019, 3:01pm

Aha, ok, finally we see something else running at 40MBps Does it get correspondingly faster with larger read sizes?

You can't change the fact that niofs (and therefore hybridfs) reads data in 16kB chunks as far as I can see, but I would have expected mmapfs to offer better sequential read performance. Maybe there is some configuration in your system that is limiting the readahead that mmap() would normally perform. I can imagine the default configuration for a network device could well be different from that of the usual locally-attached storage. This kind of tuning is well out of my area of expertise, but a colleague suggested using a tool called fio for deeper investigation of IO issues (rather than dd). In particular, fio can test the performance of sequential reads of mmap()ped files and maybe help you find the right system config for this.

I'll say again that changing to mmapfs is not a good long-term solution, but if you can get better recovery performance with mmapfs then we might be able to port that performance improvement into the default store type too.

ljefferies · July 1, 2019, 2:42pm

Doing further digging into this.

Previously we were using a single 8TB SSD rated at 16000 IOPS and a throughput of 500MB/s.

This, even though seems more than enough would not achieve these numbers as it couldn't read from the disk fastest enough, not sure why?

Reading online its seems best to attach as many disks to cloud VMs as possible and configure them as RAID0 for performance.

The VM we used allowed us to attach 16 disk drives to it, so we opted to use 16 512GB HDD each with 500 IOPS and a throughput of 60MB/s. This has allowed us to achieve a transfer speed of ~102MB/s between nodes of the same configuration.

I think the limiting factor now is that the VM has been limited to ~128MB/s and because when looking at the drives with iostat -dx it appears not to be at its limit.

system · July 29, 2019, 2:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Moving shards is slow Elasticsearch	15	5272	May 10, 2018
Shrink (merge?) performance optimization Elasticsearch	3	2307	October 19, 2017
Sharding hot vs warm Nodes Elasticsearch	8	2091	October 16, 2020
Moving shards is slow - Solution Elasticsearch	3	1014	May 15, 2020
Warm nodes respond poorly Elasticsearch	5	378	December 10, 2023

Shards Taking a Long Time to Move Between Nodes - Cloud [7.1.1]

Related topics