The same query is really slow 20% of the time

When running some of our msearch queries, the response time is usually around 500ms. However about 20% of the time the response time skyrockets to anywhere between 2000ms to 13000ms.

The index has about 120GB data, 20 shards, 1 replica, spread over 8 data nodes.

I've tried looking at Profile API results but I couldn't really decipher what's wrong. It's also extremely long for being over 20 shards.

I've tried simplifying my query to just a term filter on 1 field, but the high results still occasionally appear.

The cluster is a prod cluster, so I've tried to hit a dev one and the results were always really fast. Could be the extra traffic and we are going to try to simulate that on dev.

Has anyone experienced similar things or have any idea on what we can try?
Thank you.

Is it possible that the quick responses are benefiting from caching? Is it an identical query hit often? The long result may be the non-cached version.

unfortunately no. The cached responses are really fast. less than 100ms.
~500ms is what i get if i wait a while and resend the query.

The profiler is what I'd suggest. Have you tried the profiler UI in Kibana? That might help you pinpoint the issue. If that doesn't help, I suggest sharing the profiler output from both a slow and a fast query here so we can help you compare them.

Our Kibana is an older version and does not have the profiler UI.

Here's the slow query and fast query

I tried parsing through them, and I found that one particular shard [-UCRYIhBQrqHKGSjl8yUvA][indexName][1] has a query time of 25.5ms while the others are around 1 to 2ms.
This shard was also slow in one other profile result.

What I don't know is how does the final response time 'took' come from these profiled times? I assume the overall time for each search is query time + rewrite time + collector time, however adding up all of these doesn't reach the same magnitude of time it actually takes. Also, I assume the operations are parallelized between nodes. So I'm not sure how to get the final 'took' time from the different shard times.

Thanks for any help!

Thanks @rickbenchsci. The took time is actual elapsed time on the coordinating node, which is greater than the time spent performing the search because it includes things like the time spent waiting in a queue or moving data across the network that are omitted from the profile. I can't see anything in those profiles that would indicate a slowdown of the size you're observing, which suggests the time is lost in one of these places. Do your nodes all look to be healthy? Is one of them doing more GC than the others? If you look at the nodes stats does anything stand out as unusual? You might need to take pairs of nodes stats and subtract one from the other since many of these stats are cumulative. Do the slowdowns correspond with flushes or refreshes or anything like that?

Thank you David.

I ran the node stats and here's the content of one node:

I don't know if i'm reading this right, but the collection_time_in_millis for gc young collectors is a whopping 5449196 or 90 minutes. That doesn't seem right to me..
looking at all the nodes, the idle master nodes are at 3k, main one is at 100k,
all the data nodes are about 4000k to 7000k range
is this the main cause of the problem?

I'm also looking at marvel visualizations and on a shorter time scale, I can see that the cpu utilization is reaching 100% occasionally, while heap can sometimes reach as high as 80%.
Not sure if it's related to gc event. I've enabled gc logging and slowlog hoping to see whats happening.
On a side note, marvel visualizations are always an average which hides a lot of the spikes of extreme case that we want to look at. is there anyway i can view them as they are?

The node in question has an uptime of about 3 weeks, so 90 cumulative minutes of GC doesn't seem particularly wrong to me.

We stopped calling it Marvel quite a long time ago. What version are you on? It might be possible to visualise the max instead of the mean, but those visualisations are not my area of expertise.

Kibana is on 4.5.4 and our els is still 2.4
Since we have isolated the problem to something within the node/cluster, I plan to hit the cluster with queries throughout the day and log the node stats so we can compare and see what happens.
Do you have any suggestions on what we can do?
Thank you for all your help

Capturing the node stats over time sounds like a reasonable idea. Hopefully it'll show something happening during the slow queries. You could also try looking at hot threads during a slow query. Set ?threads=99999 to see all threads, in case whatever is slowing down the query isn't CPU-intensive.

2.4 is really rather old - old enough that I don't have a development environment for it available. This means there's only so much investigation I can do; can you reproduce this in a newer version?

Trying it in a new version requires migrating everything over. It's in the plan but we can't do it right now.
We will try looking at the hot threads.
Thank you