Slow first request on an index after a short amount of time

We are having a really weird performance issue with our ES cluster, and I've run out of things to check so I thought I'd see if anyone has run into anything similar or has ideas for further debugging.

Our symptom is this: If we go a few seconds without querying an index, it immediately starts to get slower to query for the first few queries. After we get it to the point where it goes fast, it is fast as long as we keep querying it. Also, the slow/fast response time doesn't matter based on the request - simple counts with match_all, more complicated searches, etc - they all behave the same. Once we get it going fast on a simple count, we can switch to a more complicated search and it is still fast.

It is somehow related to indexing new/updated docs. If we turn off our process that is feeding this, it stays fast indefinitely, although there are still some slower requests, they never go above 1 second. If we wait hours for thousands of docs to be indexed, it is incredibly slow (I've experienced up to 1 minute 30 seconds for the slow request, then down to like 10 milliseconds after). However, our index rate is fairly low - usually less than 10/second.

One weird thing is that Elastic thinks it is doing these queries quickly - when I set the slow query log to 0ms, I see it reporting that it takes 1ms or less, even though the request might take 10 seconds. The "took" time in the JSON response from ES might say 10 seconds, but Kibana and the slow query log say 1 ms.

I've tried to isolate the problem as much as possible - I'm doing all requests from one of the ES nodes to itself, and I set the _only_local flag on the request to force it to only use the data it has for the request.

Our first thought was that this had to do with segment merging or something causing segments to go out of the fs cache, but when we run vmstat/iostat, we see max i/o in the low 10s of MB per second on a gp2 ssd with 1200 IOPS provisioned. Most of the time cpu utilization is very low, as well.

We've also used some Java profiling tools to see where ES is spending its time and also look for places it might be blocking and haven't come up with much useful. The fact that we haven't seen anything sitting on a lock isn't super surprising given that so much of ES appears to be non-blocking, but our CPU flame graphs also provide little evidence as to what it is doing - we see a bunch of segment merge work going on (which happens outside of these requests, too) and some IndexStats requests in-flight (I'm guessing from Kibana?), but nothing related to our in-flight search query.

Has anyone run into a similar situation, or debugged stuff like this before? From all appearances, these queries should be fast and ES is claiming they are, but they obviously are not in some cases, and any feedback would be really appreciated.

Some info about our setup:

ES Version: 7.5.2 - running on the official Docker image
Cluster setup: 3x AWS m5a instances, with gp2 io @ 1200 provisioned IOPS
Index size: 25m docs, 30 GB total size
Index setup: 3 shards, 2 replicas
Index rate: ~3/sec average once we've done our bulkload

If we go a few seconds without querying an index

Can you quantify how many seconds you mean here? "A few" might mean 1-2 seconds or 30-60 seconds, and the difference is important.

Apologies for the ambiguity. A "fast" request looks to be about 7 ms to get the response back (same machine as the ES node I'm hitting, just using curl). 1 or 2 seconds and it stays the same. At 5-10 seconds, it slows down to 40 - 50ms, at 30 seconds it is at 600 - 800 ms, and at 1 minute it is 1 - 1.5 seconds.

Hmm, a proportionate slowdown isn't what I expected, but perhaps there's more than one effect here.

The main contributor will be, I think, the interaction of index.refresh_interval and index.search.idle.after. If you do not search a shard for 30 seconds (by default) then it will stop refreshing to save resources. The next search that you do will trigger a refresh within the next second (by default) and will wait for that to complete first.

Perhaps the best way to test whether this is the problem is to increase index.search.idle.after to a longer time, say 10 minutes, and see whether this speeds things up for searches spaced 30-60 seconds apart. Could you try that and see if it helps?

Ah, I think this is it, thank you! This explains a lot. The "took" time in the JSON response appears to take the waiting for refresh into account, whereas the slow query log and whatever Kibana is looking at is looking purely at search time, which explains the disparity I was seeing. It also explains why there appears to be a lower bound of around 1 second, but it goes up as time goes on (more data to refresh).

Setting index.search.idle.after to several minutes, does indeed fix it. Interestingly, so does setting index.refresh_interval to something greater than 1s. The documentation states:

Searches that hit an idle shard where a refresh is pending will wait for the next background refresh (within 1s ).

Does this mean that if index.refresh_interval is > 1 second, it won't wait around for it to refresh before searching, so you won't wait around for the refresh, but you'll potentially get stale results?

There is still a small bump in response time from single-digit milliseconds to around 50 ms after 5 - 10 seconds, but I can think of a handful of reasons for this and it isn't a big deal compared to the jump to 1 second or more, so we won't worry about it.

I don't have the code in front of me, but IIRC the search idle behaviour only applies if the refresh interval is not set.