The same query is really slow 20% of the time

rickbenchsci · September 6, 2019, 8:35pm

When running some of our msearch queries, the response time is usually around 500ms. However about 20% of the time the response time skyrockets to anywhere between 2000ms to 13000ms.

The index has about 120GB data, 20 shards, 1 replica, spread over 8 data nodes.

I've tried looking at Profile API results but I couldn't really decipher what's wrong. It's also extremely long for being over 20 shards.

I've tried simplifying my query to just a term filter on 1 field, but the high results still occasionally appear.

The cluster is a prod cluster, so I've tried to hit a dev one and the results were always really fast. Could be the extra traffic and we are going to try to simulate that on dev.

Has anyone experienced similar things or have any idea on what we can try?
Thank you.

Doc_Kaos · September 6, 2019, 8:45pm

Is it possible that the quick responses are benefiting from caching? Is it an identical query hit often? The long result may be the non-cached version.

rickbenchsci · September 6, 2019, 8:56pm

unfortunately no. The cached responses are really fast. less than 100ms.
~500ms is what i get if i wait a while and resend the query.

DavidTurner · September 7, 2019, 9:25am

The profiler is what I'd suggest. Have you tried the profiler UI in Kibana? That might help you pinpoint the issue. If that doesn't help, I suggest sharing the profiler output from both a slow and a fast query here so we can help you compare them.

rickbenchsci · September 9, 2019, 3:59pm

Our Kibana is an older version and does not have the profiler UI.

Here's the slow query and fast query
https://gist.githubusercontent.com/rickbenchsci/4883826788f624c97674a1fe610aeb3d/raw/255749fb11b6fe92b967391566a3d9149c55ea12/slowQuery
https://gist.githubusercontent.com/rickbenchsci/4883826788f624c97674a1fe610aeb3d/raw/255749fb11b6fe92b967391566a3d9149c55ea12/fastQuery

I tried parsing through them, and I found that one particular shard [-UCRYIhBQrqHKGSjl8yUvA][indexName][1] has a query time of 25.5ms while the others are around 1 to 2ms.
This shard was also slow in one other profile result.

What I don't know is how does the final response time 'took' come from these profiled times? I assume the overall time for each search is query time + rewrite time + collector time, however adding up all of these doesn't reach the same magnitude of time it actually takes. Also, I assume the operations are parallelized between nodes. So I'm not sure how to get the final 'took' time from the different shard times.

Thanks for any help!

DavidTurner · September 9, 2019, 10:08pm

Thanks @rickbenchsci. The took time is actual elapsed time on the coordinating node, which is greater than the time spent performing the search because it includes things like the time spent waiting in a queue or moving data across the network that are omitted from the profile. I can't see anything in those profiles that would indicate a slowdown of the size you're observing, which suggests the time is lost in one of these places. Do your nodes all look to be healthy? Is one of them doing more GC than the others? If you look at the nodes stats does anything stand out as unusual? You might need to take pairs of nodes stats and subtract one from the other since many of these stats are cumulative. Do the slowdowns correspond with flushes or refreshes or anything like that?

rickbenchsci · September 10, 2019, 8:26pm

Thank you David.

I ran the node stats and here's the content of one node:
https://gist.githubusercontent.com/rickbenchsci/3ecb3c448c9c42a9a6ef3e487870b927/raw/b7f57084ba92f563a812a34b382a763e79bd7fe9/gistfile1.txt

I don't know if i'm reading this right, but the collection_time_in_millis for gc young collectors is a whopping 5449196 or 90 minutes. That doesn't seem right to me..
looking at all the nodes, the idle master nodes are at 3k, main one is at 100k,
all the data nodes are about 4000k to 7000k range
is this the main cause of the problem?

I'm also looking at marvel visualizations and on a shorter time scale, I can see that the cpu utilization is reaching 100% occasionally, while heap can sometimes reach as high as 80%.
Not sure if it's related to gc event. I've enabled gc logging and slowlog hoping to see whats happening.
On a side note, marvel visualizations are always an average which hides a lot of the spikes of extreme case that we want to look at. is there anyway i can view them as they are?

DavidTurner · September 11, 2019, 6:59am

The node in question has an uptime of about 3 weeks, so 90 cumulative minutes of GC doesn't seem particularly wrong to me.

We stopped calling it Marvel quite a long time ago. What version are you on? It might be possible to visualise the max instead of the mean, but those visualisations are not my area of expertise.

rickbenchsci · September 11, 2019, 2:41pm

Kibana is on 4.5.4 and our els is still 2.4
Since we have isolated the problem to something within the node/cluster, I plan to hit the cluster with queries throughout the day and log the node stats so we can compare and see what happens.
Do you have any suggestions on what we can do?
Thank you for all your help

DavidTurner · September 15, 2019, 7:44am

Capturing the node stats over time sounds like a reasonable idea. Hopefully it'll show something happening during the slow queries. You could also try looking at hot threads during a slow query. Set ?threads=99999 to see all threads, in case whatever is slowing down the query isn't CPU-intensive.

2.4 is really rather old - old enough that I don't have a development environment for it available. This means there's only so much investigation I can do; can you reproduce this in a newer version?

rickbenchsci · September 16, 2019, 2:26pm

Trying it in a new version requires migrating everything over. It's in the plan but we can't do it right now.
We will try looking at the hot threads.
Thank you

system · October 14, 2019, 2:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Kibana search profiler accuracy Elasticsearch	4	287	November 17, 2021
Slow indexes search Elasticsearch	28	865	July 14, 2019
Slow search query issue [5.5] Elasticsearch	6	675	September 22, 2017
Kibana dashboard slowelness Elasticsearch	2	321	February 25, 2019
Slow response time even though slow logs show fast (query+fetch) Elasticsearch	1	199	April 23, 2022

The same query is really slow 20% of the time

Related topics