How to Diagnose Search Queue Growth

Hello, I'm trying to diagnose an issue where our search queue seemingly randomly fills up.

The behavior we observe in our monitor is that on one node of our cluster the search queue growth (just one) and after the search thread pool is used up we start getting timeouts of course. There seems to be one query that is blocking the while thing. The only way for us to resolve the problem at the moment is to restart the node.

You can see below the relevant behavior in charts: First the queue size, then the pending cluster tasks (to show that no other operations are blocking or queing up, e.g. index operations or so) and finally the active threads for the search thread pool. The spike at 11 o'clock is the restart of the node.

The log files on all nodes show no entries during an hour before or after the issue until we restarted the node. Only garbage collection events of around 200 -600ms and only one on the relevant node but that is around 20 minutes before the event.

My questions:

  • how can I debug this as there is no information logged anywhere on a failing or timing out query?
  • what are possible reasons for this? We don't have dynamic queries or anything similar
  • can I set a query timeout or clear / reset active searches when this happens to prevent a node restart?