Stuck "Cancelled Tasks" In ElasticSearch 8.6.2 causing Cluster failure

Hi - We are using Elasticsearch 8.6.2 running on Azure AKS and noticed some serious issues lately.

As of last week the week of Jun30 2023, we started noticing huge unresolved search transport queues in our cluster. The queues grow indefinitely until eventually thread pools start getting rejected.

The issue happens sporadically and only seems to occur in prod under high volume. We have a ~48 data node cluster with 106 shards targeted 50GB for shard size.

We did not make any major infra changs when this issue started occuring. I checked the hot_threads output as was suggested elsewhere and it did not inform much on the nodes with queues. I did notice however on the _tasks API there is a huge backlog of tasks that all look like this:

{
    "completed": false,
    "task": {
        "node": "zSJ2QgYtQs20KBXjGaW8Cw",
        "id": 86156415,
        "type": "transport",
        "action": "indices:data/read/search",
        "description": "SEARCH QUERY WAS HERE",
        "start_time_in_millis": 1688403074632,
        "running_time_in_nanos": 772230329010,
        "cancellable": true,
        "cancelled": true,
        "headers": {
            "trace.id": "f09e49b1d06cd3f71ef30869971229d7"
        }
    }
}

As you can see, the task is marked as 'cancelled' , but it stil persists for over 900 seconds. You would think Elasticsearch would see this and eventually have a circuit breaker that times it out?

There are hundreds and hundreds of these tasks that pile up before the cluster starts falling over.

I also tried taking the query from the task and running it on my own, and it returns in a prompt manner. So i dont think the issue is specific to the query.

We think there are transient/sporadic connection issues occuring that are causing connections to be timed out between nodes and then the task is stuck waiting on a response indefinitely.

Restarting affected data nodes temporarily fixes the issue, but it does not seem to be a sustainable solution, as we have been having to do this throughout the week to prevent prod outages and it has taken a considerable amount of time to just keep the queues at bay.

How can we configure elastic so that if a task is marked ' cancelled' , it will actually go away and these ghost/orphan tasks dont just pile up and kill our cluster?

It's not a config thing, this should just happen. If it doesn't then that's a bug. To investigate further, could you get GET _tasks?detailed and GET _nodes/hot_threads?threads=9999 every minute or so?

Hmm actually having said that I don't think it's a config issue, Linux has a silly default that means that it can take 900+ seconds between a connection drop and a notification to userspace about the connection drop. This page of the manual has more details. You'd see messages about dropped connections in the logs if it was this.

You'd need to be a bit more precise about what you mean by "kill our cluster" tho. The other recent thread on this topic has the zombie tasks consuming a lot of CPU, but that wouldn't happen if they were just waiting for a dead connection to time out. They'd potentially hold on to a lot of heap, causing GC pressure, but wouldn't themselves consume any CPU.

Sure thing David, let me fill the gaps. To be more precise, what happens is we see unresolved queues form on data nodes, shown below.

Over time, these queues will rapidly grow and fill up with tasks, eventually causing searches and writes to fail with es-rejected-execution-exception. So thats what i mean by 'kill our cluster' apologies for the poor choice of wording. If you see this example here, the queue rapidly increases and the only we we have found to resolve it so far has been by restarting the data node:

In our metrics we see that the CPU/Memory/GC all look fine on the nodes with queues, but the thread pool size will increase until it gets capped out by presumably the VM limitation. Once that happen, we see spikes in rejected threads and all the indices that are on the affected nodes will say that the shards have failed in the es-rejected-execution-exception logs when searched or written to.

With regards to the TCP Transmission link you gave, I believe this may apply to us? For the node to node communication, we use Azure Internal Load Balancer. This hasn't caused any issues before (3+ years of running in prod), but when troubleshooting this I noticed the idle timeout setting is 4 mins on the load balancer used to facilitate node to node communication. Should we look into enabling transport.ping_schedule as described in that post?

We actually just had a small queue start forming on one of the data nodes, ive posted the relevant output from the GET _tasks/?detailed=true on the node, and the hot_threads from the node below:

hot_threads:

::: {api-es-data-z3-2}{CtydY-1CTPy1x3awIq_iPg}{Pfrk46euSoq7AKgCe72eoQ}{api-es-data-z3-2}{10.95.43.222}{10.95.43.222:9300}{d}{zone=centralus-3, xpack.installed=true, k8s_node_name=aks-deligibilv2x-34581791-vmss000012}
   Hot threads at 2023-07-03T18:49:38.315Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

Task list:

{
    "nodes": {
        "CtydY-1CTPy1x3awIq_iPg": {
            "name": "api-es-data-z3-2",
            "transport_address": "10.95.43.222:9300",
            "host": "10.95.43.222",
            "ip": "10.95.43.222:9300",
            "roles": [
                "data"
            ],
            "attributes": {
                "zone": "centralus-3",
                "xpack.installed": "true",
                "k8s_node_name": "aks-deligibilv2x-34581791-vmss000012"
            },
            "tasks": {
                "CtydY-1CTPy1x3awIq_iPg:10819493": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10819493,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][5]]",
                    "start_time_in_millis": 1688403070853,
                    "running_time_in_nanos": 7049719215980,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "7sSP-uvqRA-Ik67M9CL3OA:91790841",
                    "headers": {
                        "trace.id": "2b49d0802f9482573f5794a468812125"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10827808": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10827808,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403088631,
                    "running_time_in_nanos": 7031941971591,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "rnAUEDT3Raayg5ynPb8j3Q:95663610",
                    "headers": {
                        "trace.id": "89a3862fa5f172479710686e691edf81"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10833324": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10833324,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403192935,
                    "running_time_in_nanos": 6927637796032,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "7sSP-uvqRA-Ik67M9CL3OA:91817035",
                    "headers": {
                        "trace.id": "c36ee7005b1f0f39d0f29a9853b462ad"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10847788": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10847788,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403835514,
                    "running_time_in_nanos": 6285058475997,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "CCUYE8gJRf2abmLTD_Szow:86756752",
                    "headers": {
                        "trace.id": "42e5ea9163c15d20c02b516139eb70e7"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10827817": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10827817,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_individual-apple-entitlement_7_att][0]]",
                    "start_time_in_millis": 1688403088660,
                    "running_time_in_nanos": 7031912240343,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "zSJ2QgYtQs20KBXjGaW8Cw:86159696",
                    "headers": {
                        "trace.id": "4f97eff83779a446bb9208ccfd6472ef"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10819754": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10819754,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][5]]",
                    "start_time_in_millis": 1688403071188,
                    "running_time_in_nanos": 7049384466290,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "mvMscRQDSgyeMrAF7QbFMw:101523708",
                    "headers": {
                        "trace.id": "2b49d0802f9482573f5794a468812125"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10991144": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10991144,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688409564326,
                    "running_time_in_nanos": 556246364018,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "mvMscRQDSgyeMrAF7QbFMw:103017306",
                    "headers": {
                        "trace.id": "a66da37ebd441e4c1759f7d9564ad019"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10830261": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10830261,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_4][0]]",
                    "start_time_in_millis": 1688403114302,
                    "running_time_in_nanos": 7006270948463,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "7sSP-uvqRA-Ik67M9CL3OA:91799606",
                    "headers": {
                        "trace.id": "fd792f587943d35979f833dbbca95bdd"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10828214": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10828214,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_orage-selection_7][0]]",
                    "start_time_in_millis": 1688403091145,
                    "running_time_in_nanos": 7029427692772,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "sDHLyXapRd2ftA2Yfgj0Vg:84269158",
                    "headers": {
                        "trace.id": "78871c70671ab8f15a86840c91a00e70"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10829361": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10829361,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_4][0]]",
                    "start_time_in_millis": 1688403099704,
                    "running_time_in_nanos": 7020868127848,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "8hTxZ7GUQHi9JXZLbPw0aw:99055191",
                    "headers": {
                        "trace.id": "258ecd481f4279244977332659feaa5e"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10832306": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10832306,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_individual-apple-entitlement_7_att][0]]",
                    "start_time_in_millis": 1688403165139,
                    "running_time_in_nanos": 6955433710215,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "sDHLyXapRd2ftA2Yfgj0Vg:84286164",
                    "headers": {
                        "trace.id": "493ba7436438ecbe5f41d767ea57f674"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10832818": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10832818,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403177664,
                    "running_time_in_nanos": 6942908850450,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "zSJ2QgYtQs20KBXjGaW8Cw:86180663",
                    "headers": {
                        "trace.id": "c905ac4d219d8e6563ab247ae7b17303"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10837554": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10837554,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403368666,
                    "running_time_in_nanos": 6751906696618,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "jvd4XtsUSDKUIF5Nd6kqgQ:94616807",
                    "headers": {
                        "trace.id": "32fb019494ccb52229705fddce27a01a"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10970557": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10970557,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_4][0]]",
                    "start_time_in_millis": 1688408774790,
                    "running_time_in_nanos": 1345782589192,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "8hTxZ7GUQHi9JXZLbPw0aw:100366074",
                    "headers": {
                        "trace.id": "455f67f4cdd28368d53cc8d4510493d2"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10835000": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10835000,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403259922,
                    "running_time_in_nanos": 6860650842310,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "sDHLyXapRd2ftA2Yfgj0Vg:84308698",
                    "headers": {
                        "trace.id": "31b3b1b83e080e3084483b9137279f6c"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10846904": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10846904,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403796655,
                    "running_time_in_nanos": 6323917112776,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "aegXmdtXRsaOQCVbQoc_1Q:92041275",
                    "headers": {
                        "trace.id": "9236549b2974b036b0c3bbd7885d640c"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10826683": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10826683,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][5]]",
                    "start_time_in_millis": 1688403084582,
                    "running_time_in_nanos": 7035990919218,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "jvd4XtsUSDKUIF5Nd6kqgQ:94554206",
                    "headers": {
                        "trace.id": "784492f9f353c680f6a5956cb66fd96b"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10834491": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10834491,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][5]]",
                    "start_time_in_millis": 1688403237221,
                    "running_time_in_nanos": 6883351042586,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "8hTxZ7GUQHi9JXZLbPw0aw:99087068",
                    "headers": {
                        "trace.id": "4931f7a12c7ce0140164c9744df519e0"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10843136": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10843136,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403617868,
                    "running_time_in_nanos": 6502704957404,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "sDHLyXapRd2ftA2Yfgj0Vg:84399086",
                    "headers": {
                        "trace.id": "905f6c9d881cde48d66c9caae08f8c94"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10859405": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10859405,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688404370749,
                    "running_time_in_nanos": 5749823972321,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "CCUYE8gJRf2abmLTD_Szow:86896339",
                    "headers": {
                        "trace.id": "132149b539a539debba54fd55edf6f28"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10826889": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10826889,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_individual-apple-entitlement_7_att][0]]",
                    "start_time_in_millis": 1688403085388,
                    "running_time_in_nanos": 7035184437544,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "sDHLyXapRd2ftA2Yfgj0Vg:84268093",
                    "headers": {
                        "trace.id": "caac88decb84f22170a68ce6f88a02c5"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10843272": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10843272,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403623995,
                    "running_time_in_nanos": 6496577249913,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "sDHLyXapRd2ftA2Yfgj0Vg:84400927",
                    "headers": {
                        "trace.id": "90b0cfb8d60b3330ac991d55661c626b"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10819595": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10819595,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403070967,
                    "running_time_in_nanos": 7049605678138,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "8hTxZ7GUQHi9JXZLbPw0aw:99048909",
                    "headers": {
                        "trace.id": "2ff8d05a0e4646a6ca6d8cbcfb26501f"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10834570": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10834570,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403240062,
                    "running_time_in_nanos": 6880510033381,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "rnAUEDT3Raayg5ynPb8j3Q:95701166",
                    "headers": {
                        "trace.id": "b5cb2e74b84862adf0d283ff15196306"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10842378": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10842378,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403589622,
                    "running_time_in_nanos": 6530950945314,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "8hTxZ7GUQHi9JXZLbPw0aw:99172931",
                    "headers": {
                        "trace.id": "e95961b353e0209dced23faf0a902f78"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10822549": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10822549,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][5]]",
                    "start_time_in_millis": 1688403075923,
                    "running_time_in_nanos": 7044649695201,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "jvd4XtsUSDKUIF5Nd6kqgQ:94552295",
                    "headers": {
                        "trace.id": "6fb7ec10bcc54be73290570b50b873a1"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10857239": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10857239,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688404276667,
                    "running_time_in_nanos": 5843905573697,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "jvd4XtsUSDKUIF5Nd6kqgQ:94836598",
                    "headers": {
                        "trace.id": "5deecf9a6b7bbb54006f485e25bac0c6"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10838418": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10838418,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403404074,
                    "running_time_in_nanos": 6716498472954,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "zSJ2QgYtQs20KBXjGaW8Cw:86226663",
                    "headers": {
                        "trace.id": "a5d608d25c643e6dce301ad5cc186186"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10827928": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10827928,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403089035,
                    "running_time_in_nanos": 7031537442761,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "CCUYE8gJRf2abmLTD_Szow:86562777",
                    "headers": {
                        "trace.id": "734f2a95c8ef96fa98c9ebbabb6db8e7"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10835685": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10835685,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403294373,
                    "running_time_in_nanos": 6826199089150,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "jvd4XtsUSDKUIF5Nd6kqgQ:94600163",
                    "headers": {
                        "trace.id": "e5aed99d433e38833e34d102e2b35928"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10830560": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10830560,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403120864,
                    "running_time_in_nanos": 6999708284197,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "CCUYE8gJRf2abmLTD_Szow:86569795",
                    "headers": {
                        "trace.id": "f2feee7c58805a210efb3e26d61d7779"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10833632": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10833632,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403203253,
                    "running_time_in_nanos": 6917319041867,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "aegXmdtXRsaOQCVbQoc_1Q:91902610",
                    "headers": {
                        "trace.id": "3c7666c165783f4b823efca782c2ab89"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10819939": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10819939,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403071564,
                    "running_time_in_nanos": 7049008575932,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "rnAUEDT3Raayg5ynPb8j3Q:95660027",
                    "headers": {
                        "trace.id": "fdc64484018887d5c2fcc7789d17c249"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10833133": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10833133,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][5]]",
                    "start_time_in_millis": 1688403186789,
                    "running_time_in_nanos": 6933783266309,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "aegXmdtXRsaOQCVbQoc_1Q:91898971",
                    "headers": {
                        "trace.id": "4c4441a743764c754cc5e84c47860e23"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10829550": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10829550,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403101497,
                    "running_time_in_nanos": 7019075885629,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "aegXmdtXRsaOQCVbQoc_1Q:91879563",
                    "headers": {
                        "trace.id": "9cc6fae6444c9ae57a85cf0ef32d224f"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10826089": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10826089,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][5]]",
                    "start_time_in_millis": 1688403082588,
                    "running_time_in_nanos": 7037984370017,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "mvMscRQDSgyeMrAF7QbFMw:101525768",
                    "headers": {
                        "trace.id": "c9519a69b60bd1859422ab576bf73787"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10832361": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10832361,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403167173,
                    "running_time_in_nanos": 6953399590682,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "zSJ2QgYtQs20KBXjGaW8Cw:86178638",
                    "headers": {
                        "trace.id": "f52dac7dc3cc973da7ed97989264b34c"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10847600": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10847600,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403828605,
                    "running_time_in_nanos": 6291967342695,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "rnAUEDT3Raayg5ynPb8j3Q:95843156",
                    "headers": {
                        "trace.id": "072a39b78243469f596996d777f5fa54"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:11000944": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 11000944,
                    "type": "transport",
                    "action": "cluster:monitor/tasks/lists[n]",
                    "description": "",
                    "start_time_in_millis": 1688410120572,
                    "running_time_in_nanos": 214913,
                    "cancellable": true,
                    "cancelled": false,
                    "parent_task_id": "mvMscRQDSgyeMrAF7QbFMw:103144293",
                    "headers": {}
                },
                "CtydY-1CTPy1x3awIq_iPg:10843005": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10843005,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][11]]",
                    "start_time_in_millis": 1688403610469,
                    "running_time_in_nanos": 6510103078894,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "7sSP-uvqRA-Ik67M9CL3OA:91902527",
                    "headers": {
                        "trace.id": "8bc73c733bb72a09c770e1ad4ff46747"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10826236": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10826236,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_20][5]]",
                    "start_time_in_millis": 1688403083034,
                    "running_time_in_nanos": 7037538839070,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "aegXmdtXRsaOQCVbQoc_1Q:91875327",
                    "headers": {
                        "trace.id": "4d0e1d5313dc62c9018d6306f46cabaa"
                    }
                },
                "CtydY-1CTPy1x3awIq_iPg:10829688": {
                    "node": "CtydY-1CTPy1x3awIq_iPg",
                    "id": 10829688,
                    "type": "transport",
                    "action": "indices:data/read/search[phase/query]",
                    "description": "shardId[[datastore_strawberry_strawberry_4][0]]",
                    "start_time_in_millis": 1688403103065,
                    "running_time_in_nanos": 7017507465605,
                    "cancellable": true,
                    "cancelled": true,
                    "parent_task_id": "CCUYE8gJRf2abmLTD_Szow:86566027",
                    "headers": {
                        "trace.id": "4d5d31994bd341a2b11d3275b219c9a9"
                    }
                },
               ....
                }
            }
        }
    }
}

Let me know if this helps or if I can provide more info.

Side note - do you think upgrading to 8.8 would address this , per this PR from the changelog:

Not at all, but thanks for the extra precision. This is not consistent with lost responses, because the search threadpool queue comprises entirely tasks waiting for a free thread to run them: a task waiting for a response does not wait in this queue.

The hot threads you shared did not include the ?threads=9999 parameter so doesn't have any useful information unfortunately.

I do not think that specific PR has any relationship with this situation. It only pertains to tasks which have a transport-layer timeout (internal to Elasticsearch) and none of the tasks involved in a search have that kind of timeout.

Apologies for the screwup - Here is the hot threads output from a node with a queue as of a few mins ago with the proper setting of ?threads=9999, this may contain more helpful info:

::: {api-es-data-z3-2}{CtydY-1CTPy1x3awIq_iPg}{gCxMPRKlSBazEarOoqS4UA}{api-es-data-z3-2}{10.95.43.214}{10.95.43.214:9300}{d}{zone=centralus-3, xpack.installed=true, k8s_node_name=aks-deligibilv2x-34581791-vmss000012}
   Hot threads at 2023-07-03T19:29:15.352Z, interval=500ms, busiestThreads=9999, ignoreIdleThreads=true:
   
    2.0% [cpu=2.0%, other=0.0%] (10ms out of 500ms) cpu usage by thread 'elasticsearch[api-es-data-z3-2][search][T#12]'
     3/10 snapshots sharing following 2 elements
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    2.0% [cpu=2.0%, other=0.0%] (9.8ms out of 500ms) cpu usage by thread 'elasticsearch[api-es-data-z3-2][search][T#9]'
     2/10 snapshots sharing following 2 elements
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    1.9% [cpu=1.9%, other=0.0%] (9.4ms out of 500ms) cpu usage by thread 'elasticsearch[api-es-data-z3-2][search][T#19]'
     6/10 snapshots sharing following 2 elements
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    1.7% [cpu=1.7%, other=0.0%] (8.6ms out of 500ms) cpu usage by thread 'elasticsearch[api-es-data-z3-2][search][T#40]'
     9/10 snapshots sharing following 2 elements
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    1.7% [cpu=1.7%, other=0.0%] (8.5ms out of 500ms) cpu usage by thread 'elasticsearch[api-es-data-z3-2][search][T#6]'
     8/10 snapshots sharing following 2 elements
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    1.5% [cpu=1.5%, other=0.0%] (7.3ms out of 500ms) cpu usage by thread 'elasticsearch[api-es-data-z3-2][search][T#27]'
     3/10 snapshots sharing following 2 elements
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    1.4% [cpu=1.4%, other=0.0%] (7ms out of 500ms) cpu usage by thread 'elasticsearch[api-es-data-z3-2][search][T#25]'
     2/10 snapshots sharing following 2 elements
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    1.4% [cpu=1.4%, other=0.0%] (6.8ms out of 500ms) cpu usage by thread 'elasticsearch[api-es-data-z3-2][search][T#30]'
     5/10 snapshots sharing following 2 elements
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    1.1% [cpu=1.1%, other=0.0%] (5.3ms out of 500ms) cpu usage by thread 'elasticsearch[api-es-data-z3-2][search][T#3]'
     2/10 snapshots sharing following 2 elements
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    0.8% [cpu=0.8%, other=0.0%] (4ms out of 500ms) cpu usage by thread 'elasticsearch[api-es-data-z3-2][search][T#15]'
     10/10 snapshots sharing following 2 elements
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    0.0% [cpu=0.0%, other=0.0%] (30.5micros out of 500ms) cpu usage by thread 'ticker-schedule-trigger-engine'
     10/10 snapshots sharing following 3 elements
       java.base@19.0.2/java.lang.Thread.sleep0(Native Method)
       java.base@19.0.2/java.lang.Thread.sleep(Thread.java:465)
       org.elasticsearch.xpack.watcher.trigger.schedule.engine.TickerScheduleTriggerEngine$Ticker.run(TickerScheduleTriggerEngine.java:195)
   
    0.0% [cpu=0.0%, other=0.0%] (0s out of 500ms) cpu usage by thread 'Connection evictor'
     10/10 snapshots sharing following 4 elements
       java.base@19.0.2/java.lang.Thread.sleep0(Native Method)
       java.base@19.0.2/java.lang.Thread.sleep(Thread.java:465)
       org.apache.httpcomponents.httpclient@4.5.13/org.apache.http.impl.client.IdleConnectionEvictor$1.run(IdleConnectionEvictor.java:66)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    0.0% [cpu=0.0%, other=0.0%] (0s out of 500ms) cpu usage by thread 'x-pack-ml-controller-stderr-pump'
     10/10 snapshots sharing following 17 elements
       java.base@19.0.2/java.io.FileInputStream.readBytes(Native Method)
       java.base@19.0.2/java.io.FileInputStream.read(FileInputStream.java:293)
       java.base@19.0.2/java.io.BufferedInputStream.read1(BufferedInputStream.java:308)
       java.base@19.0.2/java.io.BufferedInputStream.implRead(BufferedInputStream.java:382)
       java.base@19.0.2/java.io.BufferedInputStream.read(BufferedInputStream.java:367)
       java.base@19.0.2/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:333)
       java.base@19.0.2/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:376)
       java.base@19.0.2/sun.nio.cs.StreamDecoder.lockedRead(StreamDecoder.java:219)
       java.base@19.0.2/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:173)
       java.base@19.0.2/java.io.InputStreamReader.read(InputStreamReader.java:189)
       java.base@19.0.2/java.io.BufferedReader.fill(BufferedReader.java:161)
       java.base@19.0.2/java.io.BufferedReader.implReadLine(BufferedReader.java:371)
       java.base@19.0.2/java.io.BufferedReader.readLine(BufferedReader.java:348)
       java.base@19.0.2/java.io.BufferedReader.readLine(BufferedReader.java:437)
       app/org.elasticsearch.server@8.6.2/org.elasticsearch.bootstrap.Spawner.lambda$startPumpThread$3(Spawner.java:108)
       app/org.elasticsearch.server@8.6.2/org.elasticsearch.bootstrap.Spawner$$Lambda$361/0x000000080118eb90.run(Unknown Source)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    0.0% [cpu=0.0%, other=0.0%] (0s out of 500ms) cpu usage by thread 'x-pack-ml-controller-stdout-pump'
     10/10 snapshots sharing following 17 elements
       java.base@19.0.2/java.io.FileInputStream.readBytes(Native Method)
       java.base@19.0.2/java.io.FileInputStream.read(FileInputStream.java:293)
       java.base@19.0.2/java.io.BufferedInputStream.read1(BufferedInputStream.java:308)
       java.base@19.0.2/java.io.BufferedInputStream.implRead(BufferedInputStream.java:382)
       java.base@19.0.2/java.io.BufferedInputStream.read(BufferedInputStream.java:367)
       java.base@19.0.2/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:333)
       java.base@19.0.2/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:376)
       java.base@19.0.2/sun.nio.cs.StreamDecoder.lockedRead(StreamDecoder.java:219)
       java.base@19.0.2/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:173)
       java.base@19.0.2/java.io.InputStreamReader.read(InputStreamReader.java:189)
       java.base@19.0.2/java.io.BufferedReader.fill(BufferedReader.java:161)
       java.base@19.0.2/java.io.BufferedReader.implReadLine(BufferedReader.java:371)
       java.base@19.0.2/java.io.BufferedReader.readLine(BufferedReader.java:348)
       java.base@19.0.2/java.io.BufferedReader.readLine(BufferedReader.java:437)
       app/org.elasticsearch.server@8.6.2/org.elasticsearch.bootstrap.Spawner.lambda$startPumpThread$3(Spawner.java:108)
       app/org.elasticsearch.server@8.6.2/org.elasticsearch.bootstrap.Spawner$$Lambda$361/0x000000080118eb90.run(Unknown Source)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    0.0% [cpu=0.0%, other=0.0%] (0s out of 500ms) cpu usage by thread 'process reaper (pid 122)'
     10/10 snapshots sharing following 6 elements
       java.base@19.0.2/java.lang.ProcessHandleImpl.waitForProcessExit0(Native Method)
       java.base@19.0.2/java.lang.ProcessHandleImpl$1.run(ProcessHandleImpl.java:163)
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
       java.base@19.0.2/jdk.internal.misc.InnocuousThread.run(InnocuousThread.java:186)
   
    0.0% [cpu=0.0%, other=0.0%] (0s out of 500ms) cpu usage by thread 'elasticsearch[api-es-data-z3-2][write][T#18]'
     3/10 snapshots sharing following 2 elements
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    0.0% [cpu=0.0%, other=0.0%] (0s out of 500ms) cpu usage by thread 'ml-cpp-log-tail-thread'
     10/10 snapshots sharing following 6 elements
       java.base@19.0.2/java.io.FileInputStream.readBytes(Native Method)
       java.base@19.0.2/java.io.FileInputStream.read(FileInputStream.java:269)
       org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler.tailStream(CppLogMessageHandler.java:107)
       org.elasticsearch.xpack.ml.process.NativeController.lambda$tailLogsInThread$0(NativeController.java:97)
       org.elasticsearch.xpack.ml.process.NativeController$$Lambda$4218/0x0000000801b137a8.run(Unknown Source)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    0.0% [cpu=0.0%, other=0.0%] (0s out of 500ms) cpu usage by thread 'elasticsearch[keepAlive/8.6.2]'
     10/10 snapshots sharing following 8 elements
       java.base@19.0.2/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@19.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
       java.base@19.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:715)
       java.base@19.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1047)
       java.base@19.0.2/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:230)
       app/org.elasticsearch.server@8.6.2/org.elasticsearch.bootstrap.Elasticsearch.lambda$new$2(Elasticsearch.java:428)
       app/org.elasticsearch.server@8.6.2/org.elasticsearch.bootstrap.Elasticsearch$$Lambda$5890/0x0000000801f29720.run(Unknown Source)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
   
    0.0% [cpu=0.0%, other=0.0%] (0s out of 500ms) cpu usage by thread 'Thread-1'
     10/10 snapshots sharing following 8 elements
       java.base@19.0.2/java.io.FileInputStream.readBytes(Native Method)
       java.base@19.0.2/java.io.FileInputStream.read(FileInputStream.java:293)
       java.base@19.0.2/java.io.BufferedInputStream.fill(BufferedInputStream.java:255)
       java.base@19.0.2/java.io.BufferedInputStream.implRead(BufferedInputStream.java:289)
       java.base@19.0.2/java.io.BufferedInputStream.read(BufferedInputStream.java:276)
       app/org.elasticsearch.server@8.6.2/org.elasticsearch.bootstrap.Elasticsearch.lambda$startCliMonitorThread$0(Elasticsearch.java:346)
       app/org.elasticsearch.server@8.6.2/org.elasticsearch.bootstrap.Elasticsearch$$Lambda$6891/0x0000000802116600.run(Unknown Source)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)

You mention the transport layer timeouts - we are using an internally built elasticsearch plugin using the esplugin gradle featur - the plugin calls a small rest API living in the elastic cluster's kubernetes cluster on every search. The rest API then queries a smaller index within the same cluster to assist with security/filtering the data from our larger indices.

Would this be something that would fit that classification for the cancelled tasks if it timed out? IE - the plugin indefinitely waiting for a response from the rest API?

Hmm it doesn't, although this time that's the fault of the stack-collapsing behaviour of this API rather than anything you can change. But that's still pretty strange. Could you grab a few full thread dumps of the queueing node with jstack instead?

Are you only interacting with REST APIs, no transport ones? A custom plugin is kind of an all-bets-are-off situation, but I can't think of a way for a plugin to inject transport-layer timeouts into the processing of a search request.

Hmm it doesn't, although this time that's the fault of the stack-collapsing behaviour of this API rather than anything you can change. But that's still pretty strange. Could you grab a few full thread dumps of the queueing node with jstack instead?

Unfortunately one of our team members just bounced the node to resolve the queues, but once another queue forms, which im sure it will, I will try to use jstack on the queueing node and I will ping you with the output.

Are you only interacting with REST APIs, no transport ones? A custom plugin is kind of an all-bets-are-off situation, but I can't think of a way for a plugin to inject transport-layer timeouts into the processing of a search request.

Yep, totally understood that a custom plugin is outside of yall's bounds of support. Not entirely sure about the Transport API question but i dont think so? The flow goes

Search to "Core" Data node ----> Plugin invoked via search ----> Azure Load Balancer -------> call Rest API behind the LB ----> Rest API queries smaller ES cluster via k8s service discovery -------> Return result to plugin -----> Plugin concludes the original search on Core data node

I do think its worth honing in on the Azure load balancer. We use the same load balancer implementation for both the rest API invoked by the plugin, as well as for the elasticsearch node to node communication for each elasticsearch cluster (separate ports and load balancers for each)

I noticed that the documentation you linked explicitly mentioned that network devices that terminate idle connections can cause issues, and im seeing that the load balancer settings for keepIdle time is 4 minutes with a max of 30, and it can't be disabled.

Do you think its worth looking into the transport.ping_schedule config described here?

to try to fight this? Maybe increase the idleTimeout to 30 mins on the azure side and set a ping schedule to be less than the idleTimeout? We havent adjusted these settings since going to prod in 2021, but the circumstances led me here.

At the end of the day though, our primary concern is just clearing up these cancelled tasks. If theres anything we can do on our end to just terminate these things and cause them to go away, we can automate it and prevent the buildup. It doesn't seem that there's much we can do on the user end once cancelled is set to true by the Task Management API, which is already done for all of these.

I'd strongly recommend not putting a load balancer between ES nodes at all. It can cause a lot of trouble and has no meaningful benefits. transport.ping_schedule is kind of a workaround to get clusters out of a hole but it's not a great long-term solution IMO.

I hear you but I can't recommend any action without understanding the problem first.

Oh just to add that I can think of ways that a subtle bug in a plugin could cause the effect you describe, and have seen other plugins with such bugs in the past. I would expect such a bug to be reasonably obvious from the stack dump, if collected at the right time.

I'd strongly recommend not putting a load balancer between ES nodes at all. It can cause a lot of trouble and has no meaningful benefits. transport.ping_schedule is kind of a workaround to get clusters out of a hole but it's not a great long-term solution IMO.

Got it, thanks for the guidance on that point. We will stay away from this config then.

Oh just to add that I can think of ways that a subtle bug in a plugin could cause the effect you describe, and have seen other plugins with such bugs in the past. I would expect such a bug to be reasonably obvious from the stack dump, if collected at the right time.

Now this is encouraging to hear. A data point to probably make you aware of - we were able to significantly improve the issue over the weekend by adding maxIdleTime settings as well as custom error handling to the rest api invoked by our plugin to always return an empty response instead of a 500 status code if there were any exceptions. It was happening all over the place before, and once we did that, the issue became isolated to our cluster that has by far the most search traffic.

That brings me to the plugin implementation itself. I am wondering if perhaps the httpClient configured inside of it is not properly pooling connections, timing them out, or otherwise dealing with them in the proper way.

Are there any ways that I can enable logging on a custom plugin that you know of ? I've tried searching around and haven't found much. Our plugin is written in java 17 and compiled via gradle esplugin - can I add some options to the JVM flags to have it output logs to its respective data node whenever it encounters an exception or whenever some log4j statement gets invoked?

Feel free to disregard the code snippet below if this is too in the weeds for your domain , but this is the call that we think is timing out inside of the plugin and causing the queues:

public HttpAuthorizer(PluginConfig pluginConfig) {
        this.pluginConfig = pluginConfig;
        this.jsonSerde = new JsonSerde();

        PoolingHttpClientConnectionManager poolingConnManager
                = new PoolingHttpClientConnectionManager();
        poolingConnManager.setMaxTotal(10);

        RequestConfig requestConfig = RequestConfig.custom()
                .setConnectTimeout(Math.toIntExact(pluginConfig.getConnectTimeoutMs()))
                .build();

        this.httpClient = HttpClients.custom().setConnectionManager(poolingConnManager)
                .setDefaultRequestConfig(requestConfig)
                .build();
    }

    @Override
    public Boolean authorize(RecordAuthorizationRequest request) {
        String json = jsonSerde.requestToJson(request);

        return AccessController.doPrivileged((PrivilegedAction<Boolean>) () -> {
            Boolean authorized = false; // Defaults to no access

            HttpPost httpRequest = new HttpPost(URI.create(pluginConfig.getUri()));
            httpRequest.setHeader("Content-Type", "application/json");
            try {
                httpRequest.setEntity(new StringEntity(json));

                CloseableHttpResponse response;
                try {
                    response = httpClient.execute(httpRequest);
                } catch (Exception e) {
                    throw new SecurityPluginException("Unexpected exception retrieving permission data from data authorization service", e);
                }
                if (response.getStatusLine().getStatusCode() == 200) {
                    authorized = jsonSerde.parseResponse(EntityUtils.toString(response.getEntity(), "UTF-8"));
                } else {
                    throw new SecurityPluginException(
                            "Unexpected http response status "
                                    + response.getStatusLine().getStatusCode()
                                    + " returned from authorization service. Response body "
                                    + response.getEntity().toString()
                    );
                }
            }
            catch (IOException e) {
                throw new SecurityPluginException("Exception forming Elasticsearch Authorization request", e);
            }
            return authorized;
        });
    }

Perhaps this implementation would have the issue, one thing that immediately came to my mind when I found this was that there is no requestConnectTimeout or socketTimeout configured on the http client. Our plugin had to switch libraries for this call when upgrading to elasticsearch 8 base which makes me suspect it and want to potentially log/track the connection stats and exceptions if possible.

Is this happening on a thread owned by Elasticsearch (e.g. a search or indexing thread). That's exactly the bug I was thinking of if so. You must never block any of Elasticsearch's threads on a remote request like this. Either fork this work onto to a thread that you own (and don't mind blocking) or perform your HTTP request with an async API.

1 Like

This happens on a search thread i believe, yes. Every search sent by our clients invokes the plugin with their security related info to determine if they have access to the document or not, then that info passed to the plugin is sent through in the httpClient you linked there to our Rest API.

Either fork this work onto to a thread that you own (and don't mind blocking) or perform your HTTP request with an async API.

A bit unfamiliar with this topic but i will do some research. Is the forking of the work something that could be done within the same custom plugin without causing any issues, as long as i use a thread pool that isnt earmarked for searching and indexing?

It looks like the plugin is being classified as a "FilterScript" plugin and is using the HttpClient in I think what's called a LeafFactory, hopefully this can help fill in missing details:


public class SecurityPlugin extends Plugin implements ScriptPlugin, ActionPlugin {

    private static final Logger log = LogManager.getLogger(SecurityPlugin.class);

    public List<RestHandler> getRestHandlers(final Settings settings,
                                             final RestController restController,
                                             final ClusterSettings clusterSettings,
                                             final IndexScopedSettings indexScopedSettings,
                                             final SettingsFilter settingsFilter,
                                             final IndexNameExpressionResolver indexNameExpressionResolver,
                                             final Supplier<DiscoveryNodes> nodesInCluster) {
        return Collections.singletonList(new CacheRefreshRestAction());
    }

    @Override
    public ScriptEngine getScriptEngine(
            Settings settings,
            Collection<ScriptContext<?>> contexts
    ) {
        return new SecurityScriptEngine();
    }

    private static class SecurityScriptEngine implements ScriptEngine {

        private AuthorizationService authorizationService;

        SecurityScriptEngine() {
            authorizationService = AuthorizationServiceFactory.getInstance();
        }

        @Override
        public String getType() {
            return "hcp_security";
        }

        @Override
        public <T> T compile(
                String scriptName,
                String scriptSource,
                ScriptContext<T> context,
                Map<String, String> params
        ) {
            log.info("compile");
            if (!context.equals(FilterScript.CONTEXT)) {
                throw new IllegalArgumentException(getType()
                        + " scripts cannot be used for context ["
                        + context.name + "]");
            }
            // we use the script "source" as the script identifier
            if ("apply_permissions".equals(scriptSource)) {
                FilterScript.Factory factory = new SecurityPermissionsFactory(**authorizationService**);
                return context.factoryClazz.cast(factory);
            }
            throw new IllegalArgumentException("Unknown script name "
                    + scriptSource);
        }

        @Override
        public void close() {
            // optionally close resources
            log.info("close");
        }

        @Override
        public Set<ScriptContext<?>> getSupportedContexts() {
            Set<ScriptContext<?>> s = new HashSet<>();
            s.add(FilterScript.CONTEXT);
            return s;
        }

        private static class SecurityPermissionsFactory implements FilterScript.Factory,
                ScriptFactory {

            private AuthorizationService authorizationService;

            SecurityPermissionsFactory(AuthorizationService authorizationService) {
                this.authorizationService = authorizationService;
            }

            @Override
            public boolean isResultDeterministic() {
                // need to verify how this works
                return true;
            }

            @Override
            public FilterScript.LeafFactory newFactory(
                    Map<String, Object> params,
                    SearchLookup lookup
            ) {
                log.debug("newFactory");
                return new SecurityPermissionsLeafFactory(params, lookup, authorizationService);
            }
        }

        private static class SecurityPermissionsLeafFactory implements FilterScript.LeafFactory {
            private final Map<String, Object> params;
            private final SearchLookup lookup;
            private final String consumerName;
            private final List<String> roles;
            private final AuthorizationService authorizationService;

            @SuppressWarnings("unchecked")
            private SecurityPermissionsLeafFactory(
                    Map<String, Object> params, SearchLookup lookup, AuthorizationService authorizationService) {
                if (!params.containsKey("consumer_name")) {
                    throw new IllegalArgumentException(
                            "Missing parameter [consumerName]");
                }
                this.params = params;
                this.lookup = lookup;
                consumerName = params.get("consumer_name").toString();
                roles = params.get("roles") != null ? (List<String>) params.get("roles") : null;
                this.authorizationService = authorizationService;
            }

            @Override
            public FilterScript newInstance(DocReader docReader) throws IOException {

                log.debug("newInstance");

                /*
                 * the field and/or term don't exist in this segment,
                 * so always return 0
                 */
                return new FilterScript(params, lookup, docReader) {

                    @Override
                    public boolean execute() {

                        log.debug("execute");

                        Map<String, ScriptDocValues<?>> docMap = getDoc();
                        return authorizationService.isAuthorized(
                                consumerName,
                                roles,
                                getString(docMap, "security.securitySourceSystemCode"),
                                getStringList(docMap, "security.securityPermission.securityPermissionValue"),
                                getStringList(docMap, "security.securityPermissionAny.securityPermissionAnyValue")
                        );
                    }

                    private String getString(Map<String, ScriptDocValues<?>> docMap, String key) {
                        if (docMap.containsKey(key)) {
                            return ((ScriptDocValues.Strings) docMap.get(key)).getValue();
                        } else {
                            return null;
                        }
                    }

                    @SuppressWarnings("unchecked")
                    private List<String> getStringList(Map<String, ScriptDocValues<?>> docMap, String key) {
                        if (docMap.containsKey(key)) {
                            return ((List<String>) docMap.get(key));
                        } else {
                            return null;
                        }
                    }
                };
            }
        }
    }
}

You must not block any of Elasticsearch's threads, including (but not limited to) the search and indexing ones. The best approach is to go async (threads are not cheap and blocking them on network calls is very wasteful) but if you must block then do it on your own thread.

1 Like

Got it, thank you so much for your time on this. Although its not going to be an easy lift, it helps a ton to know that you have seen similar situations with custom plugins so we know to hone in on this for the troubleshooting. We were grasping at straws looking for the root cause all the past weeks.

1 Like

We fixed this issue by adding aggressive timeouts and stale connection management to the aforementioned http clients in the custom plugin. The root cause diagnosis on the blocking of threads by David does appear to be the root cause for the queue backlog.

Thanks again for all your help on this @DavidTurner. Marking this resolved.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.