Span compression potentially breaks Redis instrumentation?

I was eager to try out the new "Span compression" feature of the Java APM agent version 1.30.0. The good news: it works and it is awesome! The bad news: it looks like it leads to lockups and timeouts in our application, when it is under load :frowning:

Task failed after acquiring lock 'xxxxxx' with exception: Redis command timed out; nested exception is io.lettuce.core.RedisCommandTimeoutException: Command timed out after 3 second(s)

(stack trace is included at the end of this post)

Disabling span compression with -Delastic.apm.span_compression_enabled=false makes the timeouts go away. Perhaps the span compression settings could be fine tuned as to not disable them completely, but still have a working application?

Our application implements a custom RedisConnection proxy (see stack trace) which could interfere with the agent.

I haven't had the opportunity to enable debug logs for the agents and I first have to check if I can share them publicly. I tried to generate a Flamegraph while the application hangs, but it is generally unresponsive and is eventually killed by Kubernetes, which makes debugging even more difficult.

Our application uses Redis heavily and perhaps the span compression algorithm has problems with a larger number of spans? Maybe there's a deadlock in the agent (or in our application which is only triggered while span compression is active)?

Kibana version: 7.17.0

Elasticsearch version: 7.17.0

APM Server version: 7.17.0

APM Agent language and version: apm-agent-java 1.30.0

Browser version: irrelevant (latest Chrome)

Original install method (e.g. download page, yum, deb, from source, etc.) and version: agent installed via gradle and attached with attach API.

Fresh install or upgraded from other version? upgraded

Is there anything special in your setup? Custom Redis proxy

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):

Steps to reproduce:

  1. Start application with agent 1.30.0
  2. Apply (medium) load to the application
  3. Application becomes unresponsive and unusable. Application logs are spammed with "Cannot connect to Redis", "Redis command timeout after 3 seconds" and similar messages.

Errors in browser console (if relevant): not relevant

Provide logs and/or server output (if relevant):

With stack trace:

org.springframework.dao.QueryTimeoutException: Redis command timed out; nested exception is io.lettuce.core.RedisCommandTimeoutException: Command timed out after 3 second(s)
	at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:70)
	at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:41)
	at org.springframework.data.redis.PassThroughExceptionTranslationStrategy.translate(PassThroughExceptionTranslationStrategy.java:44)
	at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:42)
	at org.springframework.data.redis.connection.lettuce.LettuceConnection.convertLettuceAccessException(LettuceConnection.java:272)
	at org.springframework.data.redis.connection.lettuce.LettuceStringCommands.convertLettuceAccessException(LettuceStringCommands.java:800)
	at org.springframework.data.redis.connection.lettuce.LettuceStringCommands.get(LettuceStringCommands.java:70)
	at org.springframework.data.redis.connection.DefaultedRedisConnection.get(DefaultedRedisConnection.java:267)
	at jdk.internal.reflect.GeneratedMethodAccessor822.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at com.company.product.redis.RedisConnectionLoggingProxy.invoke(RedisConnectionLoggingProxy.java:85)
	at com.sun.proxy.$Proxy291.get(Unknown Source)
	at org.springframework.data.redis.core.DefaultValueOperations$1.inRedis(DefaultValueOperations.java:57)
	at org.springframework.data.redis.core.AbstractOperations$ValueDeserializingRedisCallback.doInRedis(AbstractOperations.java:60)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:222)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:189)
	at org.springframework.data.redis.core.AbstractOperations.execute(AbstractOperations.java:96)
	at org.springframework.data.redis.core.DefaultValueOperations.get(DefaultValueOperations.java:53)
	at com.company.product.services.search.service.cache.impl.ElasticSearchConfigurationListCacheImpl.getLastSyncTime(ElasticSearchConfigurationListCacheImpl.java:81)
	at com.company.product.services.search.service.impl.EntityChangeCacheServiceImpl.synchronize(EntityChangeCacheServiceImpl.java:177)
	at com.company.product.services.search.service.impl.EntityChangeCacheServiceImpl.lambda$runUpdate$2(ntityChangeCacheServiceImpl.java:140)
	at com.company.product.core.cache.DistributedRedisLocker.lambda$executeTaskInTransaction$6(DistributedRedisLocker.java:130)
	at com.company.product.transaction.impl.CustomTransactionServiceImpl.handleExecute(CustomTransactionServiceImpl.java:134)
	at com.company.product.transaction.impl.CustomTransactionServiceImpl.lambda$executeInTransaction$1(CustomTransactionServiceImpl.java:97)
	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
	at com.company.product.transaction.impl.CustomTransactionServiceImpl.executeInTransaction(CustomTransactionServiceImpl.java:97)
	at com.company.product.core.cache.DistributedRedisLocker.executeTaskInTransaction(DistributedRedisLocker.java:124)
	at com.company.product.core.cache.DistributedRedisLocker.lambda$tryLockWithTransaction$4(DistributedRedisLocker.java:114)
	at com.company.product.core.cache.SimpleRedisLocker.executeTaskWithExceptionLogging(SimpleRedisLocker.java:246)
	at com.company.product.core.cache.SimpleRedisLocker.tryExecuteLocked(SimpleRedisLocker.java:189)
	at com.company.product.core.cache.SimpleRedisLocker.tryExecuteLockedWithoutWait(SimpleRedisLocker.java:82)
	at com.company.product.core.cache.DistributedRedisLocker.lambda$tryLockWithTransaction$5(DistributedRedisLocker.java:112)
	at com.company.product.core.repository.CustomRepositoryUtil.executeWithSaveAllowed(CustomRepositoryUtil.java:100)
	at com.company.product.core.cache.DistributedRedisLocker.tryLockWithTransaction(DistributedRedisLocker.java:111)
	at com.company.product.services.search.service.impl.EntityChangeCacheServiceImpl.lambda$runUpdate$3(EntityChangeCacheServiceImpl.java:138)
	at com.company.product.util.core.logging.MDCUtil.lambda$wrap$0(MDCUtil.java:25)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1728)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177)
Caused by: io.lettuce.core.RedisCommandTimeoutException: Command timed out after 3 second(s)
	at io.lettuce.core.internal.ExceptionFactory.createTimeoutException(ExceptionFactory.java:53)
	at io.lettuce.core.internal.Futures.awaitOrCancel(Futures.java:246)
	at io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(ClusterFutureSyncInvocationHandler.java:130)
	at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
	at com.sun.proxy.$Proxy289.get(Unknown Source)
	at org.springframework.data.redis.connection.lettuce.LettuceStringCommands.get(LettuceStringCommands.java:68)
	... 37 common frames omitted

Please let me know if you need additional information or how I can debug this issue on our test system. As for agent debug logs, I first need to check if I can share them.

This graph shows all logged Redis exceptions. The spikes appear shortly after applying load to the system. With lower load it takes longer for the symptoms to show, but they will show eventually.

Could you get a full "all threads" stack dump while this is happening and see if any threads are in co.elastic.apm.agent* classes, and exactly what those calls are doing? The span compression is lock free and is implemented to skip compression rather than retry, specifically to ensure application threads proceed with minimal delay, so there must be a bug in that. 3 seconds is huge and I can't see anywhere the span compression would make that happen, so we need to know what else is happening in the other threads

Hi Jack

thanks for the hint. I managed to get a stack trace and found something that could be of relevance:

"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=552777.47ms elapsed=1091.18s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
   java.lang.Thread.State: RUNNABLE
        at co.elastic.apm.agent.impl.transaction.Span.tryToCompressComposite(Span.java:419)
        at co.elastic.apm.agent.impl.transaction.Span.tryToCompressRegular(Span.java:388)
        at co.elastic.apm.agent.impl.transaction.Span.tryToCompress(Span.java:350)
        at co.elastic.apm.agent.impl.transaction.Span.afterEnd(Span.java:325)
        at co.elastic.apm.agent.impl.transaction.AbstractSpan.end(AbstractSpan.java:476)
        at co.elastic.apm.agent.impl.transaction.AbstractSpan.end(AbstractSpan.java:458)
        at co.elastic.apm.agent.lettuce.Lettuce5StopSpanInstrumentation$OnComplete$AdviceClass.beforeComplete(Lettuce5StopSpanInstrumentation.java:78)
        at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base@11.0.7/DirectMethodHandle$Holder)
        at java.lang.invoke.LambdaForm$MH/0x00000008009d9040.linkToTargetMethod(java.base@11.0.7/LambdaForm$MH)
        at io.lettuce.core.protocol.AsyncCommand.complete(AsyncCommand.java:110)
        at io.lettuce.core.protocol.CommandWrapper.complete(CommandWrapper.java:63)
        at io.lettuce.core.cluster.ClusterCommand.complete(ClusterCommand.java:65)
        at io.lettuce.core.protocol.CommandWrapper.complete(CommandWrapper.java:63)
        at io.lettuce.core.protocol.CommandHandler.complete(CommandHandler.java:742)
        at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:677)
        at io.lettuce.core.protocol.CommandHandler.channelRead(CommandHandler.java:594)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)

This thread seems to hog 1 CPU core: the CPU time increases by roughly 1000 ms per second:

$ for i in {1..100}; do jcmd backbone Thread.print | grep 'lettuce-epoll' | grep '#71'; sleep 1s; done
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=693676.34ms elapsed=1233.93s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=694999.72ms elapsed=1235.28s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=696301.22ms elapsed=1236.61s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=697612.38ms elapsed=1237.94s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=698895.99ms elapsed=1239.27s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=700217.29ms elapsed=1240.64s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=701508.52ms elapsed=1241.95s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=702801.55ms elapsed=1243.28s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=704110.61ms elapsed=1244.61s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=705429.77ms elapsed=1245.95s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=706726.65ms elapsed=1247.27s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=708040.76ms elapsed=1248.61s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=709338.02ms elapsed=1249.93s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=710636.44ms elapsed=1251.25s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=711951.10ms elapsed=1252.59s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=713267.90ms elapsed=1253.94s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=714576.60ms elapsed=1255.29s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=715855.25ms elapsed=1256.62s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=717133.53ms elapsed=1257.93s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=718414.54ms elapsed=1259.26s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
# ...

Other lettuce threads are in the low hundreds of used CPU ms time:

$ jcmd backbone Thread.print | grep 'lettuce-epoll'
"lettuce-epollEventLoop-4-1" #70 daemon prio=5 os_prio=0 cpu=4958.22ms elapsed=1382.32s tid=0x00007f9150038800 nid=0xcc runnable  [0x00007f91658fb000]
"lettuce-epollEventLoop-4-2" #71 daemon prio=5 os_prio=0 cpu=839027.41ms elapsed=1382.28s tid=0x00007f91500b5800 nid=0xcd runnable  [0x00007f9164de9000]
"lettuce-epollEventLoop-4-3" #72 daemon prio=5 os_prio=0 cpu=216.20ms elapsed=1382.28s tid=0x00007f91500c5800 nid=0xce runnable  [0x00007f9164ce9000]
"lettuce-epollEventLoop-4-4" #79 daemon prio=5 os_prio=0 cpu=139.96ms elapsed=1381.93s tid=0x00007f9140156800 nid=0xd4 runnable  [0x00007f91646e3000]
"lettuce-epollEventLoop-4-5" #80 daemon prio=5 os_prio=0 cpu=106.72ms elapsed=1381.92s tid=0x00007f9140162800 nid=0xd5 runnable  [0x00007f91645e2000]
"lettuce-epollEventLoop-4-6" #81 daemon prio=5 os_prio=0 cpu=233.81ms elapsed=1381.92s tid=0x00007f9140163800 nid=0xd6 runnable  [0x00007f91644e1000]
"lettuce-epollEventLoop-4-7" #83 daemon prio=5 os_prio=0 cpu=230.59ms elapsed=1381.74s tid=0x00007f9150307800 nid=0xd8 runnable  [0x00007f91647e4000]
"lettuce-epollEventLoop-4-8" #84 daemon prio=5 os_prio=0 cpu=3947.48ms elapsed=1381.58s tid=0x00007f9150150800 nid=0xd9 runnable  [0x00007f9144dfb000]
"lettuce-epollEventLoop-11-1" #250 daemon prio=5 os_prio=0 cpu=7.63ms elapsed=1313.50s tid=0x00007f8f28067000 nid=0x20b runnable  [0x00007f8f5b5f4000]
"lettuce-epollEventLoop-11-2" #252 daemon prio=5 os_prio=0 cpu=184.21ms elapsed=1313.38s tid=0x00007f8f282e0800 nid=0x20d runnable  [0x00007f8f5a3f1000]
"lettuce-epollEventLoop-12-1" #255 daemon prio=5 os_prio=0 cpu=8.53ms elapsed=1313.34s tid=0x00007f8f2833b000 nid=0x210 runnable  [0x00007f8f590ed000]
"lettuce-epollEventLoop-12-2" #257 daemon prio=5 os_prio=0 cpu=12.11ms elapsed=1313.33s tid=0x00007f8f2838e000 nid=0x212 runnable  [0x00007f8f58eeb000]
"lettuce-epollEventLoop-13-1" #260 daemon prio=5 os_prio=0 cpu=8.67ms elapsed=1313.31s tid=0x00007f8f282f6000 nid=0x215 runnable  [0x00007f8f58be8000]
"lettuce-epollEventLoop-13-2" #262 daemon prio=5 os_prio=0 cpu=158.78ms elapsed=1313.30s tid=0x00007f8f2827b000 nid=0x217 runnable  [0x00007f90570f1000]

Thanks, definitely a bug, looks to be in recycling objects which will take time to track down. If you can provide agent logs with loglevel TRACE that would be helpful - but those would be BIG logs

Even DEBUG level would help

I'll try to get hold of DEBUG logs for the agent, but it might take a few days.

@Jack_Shirazi do you still need the full thread dump (in case the bug is somewhere else)? I have a created a (cleaned-up) version of it. Would need to upload it somewhere (or send by email).

Thanks!

Could the first few lines of the method Span#tryToCompressComposite be problematic?

    private boolean tryToCompressComposite(Span sibling) {
        String compressionStrategy;
        do {
            compressionStrategy = composite.getCompressionStrategy();
        } while (compressionStrategy == null);

What if compressionStrategy is not set and remains null or is set to non-null very late? This method would busy-spin in this loop, blocking further execution.

I don't need the full thread dump. The code there is the issue, it's only null when the object has been recycled, hence the need to identify why it's been recycled prematurely

@Jack_Shirazi I finally managed to get a DEBUG log which exhibits the problem. It was not easy. Without DEBUG log, the issue can usually be reproduced within 1 or 2 minutes. With DEBUG logs it didn't show. Eventually, after 12 or so hours, I had a service instance which locked up again.

From the thread dump, it looks like 2 threads are stuck in Span#tryToCompressComposite.

I trimmed the log file down to 700k lines (1 minute before the first lockup + 1 minute after). Gzipped that's 8MB (22MB for the untrimmed version). How can I get the log file to you for analyze? Or is there anything I can analyze myself in the log file; any messages to look out for?

Thanks

Primarily the recycling, which should be reported by BookkeeperObjectPool

The DEBUG log files (current .log.json + history .log.json.1) do not contain any lines which contain "Bookkeeper" or "ObjectPool" :frowning:

Perhaps I need to go for TRACE? But those are massive

I'll try to generate the logs and reproduce the bug. Thanks for the pointers!

DEBUG should be fine. Search for "pool" which should also find "pooled"? I'm creating a workaround which I hope you can test, and I think I have a handle on where the recycling is premature so hopefully I'll have a fix later

If you are able to test the snapshot build, that would be great (specifically elastic-apm-agent-1.30.1-SNAPSHOT.jar.
This is just a workaround rather than a full fix

I have deployed the workaround SNAPSHOT of the agent in a test cluster now and will leave it running overnight. So far, no lockups were observed (uptime: 30 minutes). Looking at several thread dumps, tryToCompressComposite does not show up anymore.

Thanks so far, I will get back with an update tomorrow morning (UTC/CEST).

@Jack_Shirazi The application has been running for almost 20h now with the agent snapshot version, alternating between medium and high load – I didn't observe any lockups! :slight_smile:
Thank you very much.

Thanks, and just to confirm, the compression of spans in the UI still shows as previously?

Hi @Jack_Shirazi yes, it does! I checked some transactions in Kibana and can see the compressed spans:

image

image

Lovely, thanks! Would you be willing to test the proposed fix - artifacts here Jenkins . It specifically doesn't include the workaround so that we can quickly see whether it fails

Sure. I should be able to do a test deployment next week and will report back with the results

@Jack_Shirazi unfortunately, with the new snapshot (the one without initial workaround), the application locks up again in the tryToCompressComposite method:

$ for i in {1..10}; do jcmd application Thread.print | grep -B2 CompressComposite; sleep 1s; done | grep lettuce
"lettuce-epollEventLoop-4-1" #70 daemon prio=5 os_prio=0 cpu=500784.69ms elapsed=678.15s tid=0x00007fa3b42e9000 nid=0xd0 runnable  [0x00007fa3c21f9000]
"lettuce-epollEventLoop-4-1" #70 daemon prio=5 os_prio=0 cpu=502063.59ms elapsed=679.44s tid=0x00007fa3b42e9000 nid=0xd0 runnable  [0x00007fa3c21f9000]
"lettuce-epollEventLoop-4-1" #70 daemon prio=5 os_prio=0 cpu=503342.88ms elapsed=680.73s tid=0x00007fa3b42e9000 nid=0xd0 runnable  [0x00007fa3c21f9000]
"lettuce-epollEventLoop-4-1" #70 daemon prio=5 os_prio=0 cpu=504620.96ms elapsed=682.02s tid=0x00007fa3b42e9000 nid=0xd0 runnable  [0x00007fa3c21f9000]
"lettuce-epollEventLoop-4-1" #70 daemon prio=5 os_prio=0 cpu=505881.41ms elapsed=683.29s tid=0x00007fa3b42e9000 nid=0xd0 runnable  [0x00007fa3c21f9000]
"lettuce-epollEventLoop-4-1" #70 daemon prio=5 os_prio=0 cpu=507156.95ms elapsed=684.57s tid=0x00007fa3b42e9000 nid=0xd0 runnable  [0x00007fa3c21f9000]
"lettuce-epollEventLoop-4-1" #70 daemon prio=5 os_prio=0 cpu=508470.72ms elapsed=685.89s tid=0x00007fa3b42e9000 nid=0xd0 runnable  [0x00007fa3c21f9000]
"lettuce-epollEventLoop-4-1" #70 daemon prio=5 os_prio=0 cpu=509742.32ms elapsed=687.18s tid=0x00007fa3b42e9000 nid=0xd0 runnable  [0x00007fa3c21f9000]
"lettuce-epollEventLoop-4-1" #70 daemon prio=5 os_prio=0 cpu=510999.74ms elapsed=688.44s tid=0x00007fa3b42e9000 nid=0xd0 runnable  [0x00007fa3c21f9000]
"lettuce-epollEventLoop-4-1" #70 daemon prio=5 os_prio=0 cpu=512252.02ms elapsed=689.70s tid=0x00007fa3b42e9000 nid=0xd0 runnable  [0x00007fa3c21f9000]