I'm not sure how you found a blank string, but typically it won't be there at all indicating it's set to the default value. Which is fs which today means hybridfs which, in turn, means a combination of niofs and mmapfs that's appropriate for search. I have a suspicion that adjusting this blend towards mmapfs during recoveries might help here.
The stack traces you've shared indicate that the node is now spending more time doing TLS-related activities. They also suggest that it's sending chunks in parallel. Can you share another output from the transport logger like yesterday? Does it look like there's more concurrency here?
Can you find the recovery thread in the hot threads output? Even if it's not in the top n threads for CPU usage it'll still be illuminating to see where it's spending its time after moving to mmapfs.
4.9% (24.6ms out of 500ms) cpu usage by thread 'elasticsearch[warm-2][transport_worker][T#13]'
8/10 snapshots sharing following 2 elements
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
0.5% (2.3ms out of 500ms) cpu usage by thread 'elasticsearch[warm-2][transport_worker][T#12]'
2/10 snapshots sharing following 41 elements
java.base@11.0.3/java.security.MessageDigestSpi.engineUpdate(MessageDigestSpi.java:115)
java.base@11.0.3/java.security.MessageDigest$Delegate.engineUpdate(MessageDigest.java:617)
java.base@11.0.3/java.security.MessageDigest.update(MessageDigest.java:373)
java.base@11.0.3/com.sun.crypto.provider.HmacCore.engineUpdate(HmacCore.java:201)
java.base@11.0.3/javax.crypto.Mac.update(Mac.java:554)
java.base@11.0.3/sun.security.ssl.Authenticator$MacImpl.compute(Authenticator.java:512)
java.base@11.0.3/sun.security.ssl.Authenticator$TLS10Mac.compute(Authenticator.java:581)
java.base@11.0.3/sun.security.ssl.Authenticator$MAC.compute(Authenticator.java:444)
java.base@11.0.3/sun.security.ssl.SSLCipher.addMac(SSLCipher.java:2089)
java.base@11.0.3/sun.security.ssl.SSLCipher$T11BlockWriteCipherGenerator$BlockWriteCipher.encrypt(SSLCipher.java:1449)
java.base@11.0.3/sun.security.ssl.OutputRecord.t10Encrypt(OutputRecord.java:366)
java.base@11.0.3/sun.security.ssl.OutputRecord.encrypt(OutputRecord.java:274)
java.base@11.0.3/sun.security.ssl.SSLEngineOutputRecord.encode(SSLEngineOutputRecord.java:274)
java.base@11.0.3/sun.security.ssl.SSLEngineOutputRecord.encode(SSLEngineOutputRecord.java:175)
java.base@11.0.3/sun.security.ssl.SSLEngineImpl.encode(SSLEngineImpl.java:274)
java.base@11.0.3/sun.security.ssl.SSLEngineImpl.writeRecord(SSLEngineImpl.java:233)
java.base@11.0.3/sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:136)
java.base@11.0.3/sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:116)
java.base@11.0.3/javax.net.ssl.SSLEngine.wrap(SSLEngine.java:519)
io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:1023)
io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:829)
io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:797)
io.netty.handler.ssl.SslHandler.flush(SslHandler.java:778)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
io.netty.handler.logging.LoggingHandler.flush(LoggingHandler.java:265)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
io.netty.channel.AbstractChannelHandlerContext.access$1500(AbstractChannelHandlerContext.java:38)
io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1152)
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1075)
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:474)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
4/10 snapshots sharing following 8 elements
java.base@11.0.3/sun.nio.ch.EPoll.wait(Native Method)
java.base@11.0.3/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120)
java.base@11.0.3/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
java.base@11.0.3/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
2/10 snapshots sharing following 20 elements
io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:797)
io.netty.handler.ssl.SslHandler.flush(SslHandler.java:778)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
io.netty.handler.logging.LoggingHandler.flush(LoggingHandler.java:265)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
io.netty.channel.AbstractChannelHandlerContext.access$1500(AbstractChannelHandlerContext.java:38)
io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1152)
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1075)
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:474)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
0.3% (1.3ms out of 500ms) cpu usage by thread 'elasticsearch[warm-2][generic][T#4]'
10/10 snapshots sharing following 2 elements
java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
0.1% (694.4micros out of 500ms) cpu usage by thread 'elasticsearch[warm-2][generic][T#9]'
8/10 snapshots sharing following 2 elements
java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
0.0% (27.6micros out of 500ms) cpu usage by thread 'ticker-schedule-trigger-engine'
10/10 snapshots sharing following 2 elements
java.base@11.0.3/java.lang.Thread.sleep(Native Method)
org.elasticsearch.xpack.watcher.trigger.schedule.engine.TickerScheduleTriggerEngine$Ticker.run(TickerScheduleTriggerEngine.java:193)
0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[keepAlive/7.1.1]'
10/10 snapshots sharing following 8 elements
java.base@11.0.3/jdk.internal.misc.Unsafe.park(Native Method)
java.base@11.0.3/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
java.base@11.0.3/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
app//org.elasticsearch.bootstrap.Bootstrap$1.run(Bootstrap.java:80)
java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
0.0% (0s out of 500ms) cpu usage by thread 'Connection evictor'
10/10 snapshots sharing following 3 elements
java.base@11.0.3/java.lang.Thread.sleep(Native Method)
org.apache.http.impl.client.IdleConnectionEvictor$1.run(IdleConnectionEvictor.java:66)
java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
0.0% (0s out of 500ms) cpu usage by thread 'process reaper'
10/10 snapshots sharing following 5 elements
java.base@11.0.3/java.lang.ProcessHandleImpl.waitForProcessExit0(Native Method)
java.base@11.0.3/java.lang.ProcessHandleImpl$1.run(ProcessHandleImpl.java:138)
java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
Ok, the recovery source thread is using less CPU but it's still stuck on IO 10 times out of 10:
0.2% (852micros out of 500ms) cpu usage by thread 'elasticsearch[warm-2][generic][T#8]'
10/10 snapshots sharing following 37 elements
app//org.apache.lucene.store.DataInput.readBytes(DataInput.java:87)
app//org.elasticsearch.common.lucene.store.InputStreamIndexInput.read(InputStreamIndexInput.java:73)
app//org.elasticsearch.indices.recovery.RecoverySourceHandler.sendFiles(RecoverySourceHandler.java:697)
app//org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:414)
app//org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:180)
It's looking to me that the problem lies outside Elasticsearch. Can you do any investigation into IO issues on these nodes? How to dig deeper depends on your OS and environment and is outside my area of expertise, but for instance how fast you can pull data off disk using something like dd?
Network attached storage tends to have higher latency, and Elasticsearch's read sizes are much smaller than 250MB. How quickly can you stream 50GB of data from the disk with bs=16k instead?
Aha, ok, finally we see something else running at 40MBps Does it get correspondingly faster with larger read sizes?
You can't change the fact that niofs (and therefore hybridfs) reads data in 16kB chunks as far as I can see, but I would have expected mmapfs to offer better sequential read performance. Maybe there is some configuration in your system that is limiting the readahead that mmap() would normally perform. I can imagine the default configuration for a network device could well be different from that of the usual locally-attached storage. This kind of tuning is well out of my area of expertise, but a colleague suggested using a tool called fio for deeper investigation of IO issues (rather than dd). In particular, fio can test the performance of sequential reads of mmap()ped files and maybe help you find the right system config for this.
I'll say again that changing to mmapfs is not a good long-term solution, but if you can get better recovery performance with mmapfs then we might be able to port that performance improvement into the default store type too.
Previously we were using a single 8TB SSD rated at 16000 IOPS and a throughput of 500MB/s.
This, even though seems more than enough would not achieve these numbers as it couldn't read from the disk fastest enough, not sure why?
Reading online its seems best to attach as many disks to cloud VMs as possible and configure them as RAID0 for performance.
The VM we used allowed us to attach 16 disk drives to it, so we opted to use 16 512GB HDD each with 500 IOPS and a throughput of 60MB/s. This has allowed us to achieve a transfer speed of ~102MB/s between nodes of the same configuration.
I think the limiting factor now is that the VM has been limited to ~128MB/s and because when looking at the drives with iostat -dx it appears not to be at its limit.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.