Failed to execute watch - Timeout waiting for task

I am noticing random watch timeouts on my cluster that is returning the following traceback:

[2021-07-31T05:08:11,951][DEBUG][o.e.x.w.e.ExecutionService] failed to execute watch [<INSERT RANDOM WATCHER HERE>]
org.elasticsearch.ElasticsearchTimeoutException: java.util.concurrent.TimeoutException: Timeout waiting for task.
	at org.elasticsearch.common.util.concurrent.FutureUtils.get( ~[elasticsearch-7.10.1.jar:7.10.1]
	at ~[elasticsearch-7.10.1.jar:7.10.1]
	at ~[elasticsearch-7.10.1.jar:7.10.1]
	at org.elasticsearch.xpack.watcher.execution.ExecutionService.updateWatchStatus( [x-pack-watcher-7.10.1.jar:7.10.1]
	at org.elasticsearch.xpack.watcher.execution.ExecutionService.execute( [x-pack-watcher-7.10.1.jar:7.10.1]
	at org.elasticsearch.xpack.watcher.execution.ExecutionService.lambda$executeAsync$5( [x-pack-watcher-7.10.1.jar:7.10.1]
	at org.elasticsearch.xpack.watcher.execution.ExecutionService$ [x-pack-watcher-7.10.1.jar:7.10.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ [elasticsearch-7.10.1.jar:7.10.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker( [?:?]
	at java.util.concurrent.ThreadPoolExecutor$ [?:?]
	at [?:?]
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get( ~[elasticsearch-7.10.1.jar:7.10.1]
	at org.elasticsearch.common.util.concurrent.BaseFuture.get( ~[elasticsearch-7.10.1.jar:7.10.1]
	at org.elasticsearch.common.util.concurrent.FutureUtils.get( ~[elasticsearch-7.10.1.jar:7.10.1]
	... 10 more

I think the key part is that its failing on:

org.elasticsearch.xpack.watcher.execution.ExecutionService.updateWatchStatus( [x-pack-watcher-7.10.1.jar:7.10.1]

Which in the source code equates to:


I assume it is trying to update the .watches document and is taking a long time, but I am not 100% sure. Has anyone seen this error before, and if so any way to resolve? The watchers are running on cool nodes, and the cluster is pretty big but load doesnt seem to get too high on these nodes.

Just leaving this here in case it helps someone out in the future - this seems like it is most likely due to the nodes being too busy to execute in a timely manner. Throughout the day there is a fair amount of rollover occurring from the warm -> cool nodes, and after adjusting some of the rebalancing settings to be more conservative (specifically cluster.routing.allocation.cluster_concurrent_rebalance, cluster.routing.allocation.node_concurrent_recoveries, cluster.routing.allocation.node_initial_primaries_recoveries, and indices.recovery.max_bytes_per_sec), the failures appear to be occurring less frequently / not at all.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.