Search queue rejections climb while search queue stays empty?

Hello.

Posting again after a long hiatus because I recently came across a behavior that I didn't think was possible and I'd like some of the elastic.co community to validate my thoughts.

ES version 2.0 prior to upgrade to 2.4.4 (yes, I know...)
OS: CentOS 6.9 (yes, I know...)
70 node cluster in AWS, r3.2xl (yes, I know...)

Situation is that I have a variety of metrics via DataDog. One of those shows search active which maxed out at 7 threads, as expected, under load. The search queue size never grew, not even spiking above 0, while the reject count hits 800. All of these occur at the same time. This was a momentary spike and in our environment, we can accept the rejections and clients can (will) retry. But if my understanding is right, the rejections should not have occurred.

So what could possibly happen that the search queue never grows, never accepts any queries, when it's sized at 1000 entries? To go from active threads maxed out to search rejections when we should have queued 1000 queries first means I'm either missing something, something is broken, or my mental model is wrong. Which is it?

Cheers!

How about this scenario: let's say I have 4 cores on my machines which will translate into 7 threads in the search thread pool. My search time is pretty short - 10ms. I get hit by 180 search requests at the same time and each request hits 10 shards on my node, in other words my node got 1800 requests at the same time. So, the first 7 shard requests will get on thread each the next 1000 will get into queue and the rest 793 request will get rejected and increase the reject count. 10ms later I the first 7 shard requests will be fulfilled and we start chewing our way through the queue. Assuming 10ms per shard request, 7 requests at a time, 1.5seconds later we will be done with the queue and our queue size is 0. So, now the question is how often DataDog collects stats from my cluster? Every 5 seconds? 10 seconds? Is it possible that they just missed the time when the queue was full? That's one possible explanation, another one is that some stats collection process is broken.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.