I was eager to try out the new "Span compression" feature of the Java APM agent version 1.30.0. The good news: it works and it is awesome! The bad news: it looks like it leads to lockups and timeouts in our application, when it is under load
Task failed after acquiring lock 'xxxxxx' with exception: Redis command timed out; nested exception is io.lettuce.core.RedisCommandTimeoutException: Command timed out after 3 second(s)
(stack trace is included at the end of this post)
Disabling span compression with -Delastic.apm.span_compression_enabled=false
makes the timeouts go away. Perhaps the span compression settings could be fine tuned as to not disable them completely, but still have a working application?
Our application implements a custom RedisConnection proxy (see stack trace) which could interfere with the agent.
I haven't had the opportunity to enable debug logs for the agents and I first have to check if I can share them publicly. I tried to generate a Flamegraph while the application hangs, but it is generally unresponsive and is eventually killed by Kubernetes, which makes debugging even more difficult.
Our application uses Redis heavily and perhaps the span compression algorithm has problems with a larger number of spans? Maybe there's a deadlock in the agent (or in our application which is only triggered while span compression is active)?
Kibana version: 7.17.0
Elasticsearch version: 7.17.0
APM Server version: 7.17.0
APM Agent language and version: apm-agent-java 1.30.0
Browser version: irrelevant (latest Chrome)
Original install method (e.g. download page, yum, deb, from source, etc.) and version: agent installed via gradle and attached with attach API.
Fresh install or upgraded from other version? upgraded
Is there anything special in your setup? Custom Redis proxy
Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):
Steps to reproduce:
- Start application with agent 1.30.0
- Apply (medium) load to the application
- Application becomes unresponsive and unusable. Application logs are spammed with "Cannot connect to Redis", "Redis command timeout after 3 seconds" and similar messages.
Errors in browser console (if relevant): not relevant
Provide logs and/or server output (if relevant):
With stack trace:
org.springframework.dao.QueryTimeoutException: Redis command timed out; nested exception is io.lettuce.core.RedisCommandTimeoutException: Command timed out after 3 second(s)
at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:70)
at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:41)
at org.springframework.data.redis.PassThroughExceptionTranslationStrategy.translate(PassThroughExceptionTranslationStrategy.java:44)
at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:42)
at org.springframework.data.redis.connection.lettuce.LettuceConnection.convertLettuceAccessException(LettuceConnection.java:272)
at org.springframework.data.redis.connection.lettuce.LettuceStringCommands.convertLettuceAccessException(LettuceStringCommands.java:800)
at org.springframework.data.redis.connection.lettuce.LettuceStringCommands.get(LettuceStringCommands.java:70)
at org.springframework.data.redis.connection.DefaultedRedisConnection.get(DefaultedRedisConnection.java:267)
at jdk.internal.reflect.GeneratedMethodAccessor822.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at com.company.product.redis.RedisConnectionLoggingProxy.invoke(RedisConnectionLoggingProxy.java:85)
at com.sun.proxy.$Proxy291.get(Unknown Source)
at org.springframework.data.redis.core.DefaultValueOperations$1.inRedis(DefaultValueOperations.java:57)
at org.springframework.data.redis.core.AbstractOperations$ValueDeserializingRedisCallback.doInRedis(AbstractOperations.java:60)
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:222)
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:189)
at org.springframework.data.redis.core.AbstractOperations.execute(AbstractOperations.java:96)
at org.springframework.data.redis.core.DefaultValueOperations.get(DefaultValueOperations.java:53)
at com.company.product.services.search.service.cache.impl.ElasticSearchConfigurationListCacheImpl.getLastSyncTime(ElasticSearchConfigurationListCacheImpl.java:81)
at com.company.product.services.search.service.impl.EntityChangeCacheServiceImpl.synchronize(EntityChangeCacheServiceImpl.java:177)
at com.company.product.services.search.service.impl.EntityChangeCacheServiceImpl.lambda$runUpdate$2(ntityChangeCacheServiceImpl.java:140)
at com.company.product.core.cache.DistributedRedisLocker.lambda$executeTaskInTransaction$6(DistributedRedisLocker.java:130)
at com.company.product.transaction.impl.CustomTransactionServiceImpl.handleExecute(CustomTransactionServiceImpl.java:134)
at com.company.product.transaction.impl.CustomTransactionServiceImpl.lambda$executeInTransaction$1(CustomTransactionServiceImpl.java:97)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
at com.company.product.transaction.impl.CustomTransactionServiceImpl.executeInTransaction(CustomTransactionServiceImpl.java:97)
at com.company.product.core.cache.DistributedRedisLocker.executeTaskInTransaction(DistributedRedisLocker.java:124)
at com.company.product.core.cache.DistributedRedisLocker.lambda$tryLockWithTransaction$4(DistributedRedisLocker.java:114)
at com.company.product.core.cache.SimpleRedisLocker.executeTaskWithExceptionLogging(SimpleRedisLocker.java:246)
at com.company.product.core.cache.SimpleRedisLocker.tryExecuteLocked(SimpleRedisLocker.java:189)
at com.company.product.core.cache.SimpleRedisLocker.tryExecuteLockedWithoutWait(SimpleRedisLocker.java:82)
at com.company.product.core.cache.DistributedRedisLocker.lambda$tryLockWithTransaction$5(DistributedRedisLocker.java:112)
at com.company.product.core.repository.CustomRepositoryUtil.executeWithSaveAllowed(CustomRepositoryUtil.java:100)
at com.company.product.core.cache.DistributedRedisLocker.tryLockWithTransaction(DistributedRedisLocker.java:111)
at com.company.product.services.search.service.impl.EntityChangeCacheServiceImpl.lambda$runUpdate$3(EntityChangeCacheServiceImpl.java:138)
at com.company.product.util.core.logging.MDCUtil.lambda$wrap$0(MDCUtil.java:25)
at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
at java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1728)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177)
Caused by: io.lettuce.core.RedisCommandTimeoutException: Command timed out after 3 second(s)
at io.lettuce.core.internal.ExceptionFactory.createTimeoutException(ExceptionFactory.java:53)
at io.lettuce.core.internal.Futures.awaitOrCancel(Futures.java:246)
at io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(ClusterFutureSyncInvocationHandler.java:130)
at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
at com.sun.proxy.$Proxy289.get(Unknown Source)
at org.springframework.data.redis.connection.lettuce.LettuceStringCommands.get(LettuceStringCommands.java:68)
... 37 common frames omitted
Please let me know if you need additional information or how I can debug this issue on our test system. As for agent debug logs, I first need to check if I can share them.
This graph shows all logged Redis exceptions. The spikes appear shortly after applying load to the system. With lower load it takes longer for the symptoms to show, but they will show eventually.