We are having a really weird performance issue with our ES cluster, and I've run out of things to check so I thought I'd see if anyone has run into anything similar or has ideas for further debugging.
Our symptom is this: If we go a few seconds without querying an index, it immediately starts to get slower to query for the first few queries. After we get it to the point where it goes fast, it is fast as long as we keep querying it. Also, the slow/fast response time doesn't matter based on the request - simple counts with match_all, more complicated searches, etc - they all behave the same. Once we get it going fast on a simple count, we can switch to a more complicated search and it is still fast.
It is somehow related to indexing new/updated docs. If we turn off our process that is feeding this, it stays fast indefinitely, although there are still some slower requests, they never go above 1 second. If we wait hours for thousands of docs to be indexed, it is incredibly slow (I've experienced up to 1 minute 30 seconds for the slow request, then down to like 10 milliseconds after). However, our index rate is fairly low - usually less than 10/second.
One weird thing is that Elastic thinks it is doing these queries quickly - when I set the slow query log to 0ms, I see it reporting that it takes 1ms or less, even though the request might take 10 seconds. The "took" time in the JSON response from ES might say 10 seconds, but Kibana and the slow query log say 1 ms.
I've tried to isolate the problem as much as possible - I'm doing all requests from one of the ES nodes to itself, and I set the _only_local flag on the request to force it to only use the data it has for the request.
Our first thought was that this had to do with segment merging or something causing segments to go out of the fs cache, but when we run vmstat/iostat, we see max i/o in the low 10s of MB per second on a gp2 ssd with 1200 IOPS provisioned. Most of the time cpu utilization is very low, as well.
We've also used some Java profiling tools to see where ES is spending its time and also look for places it might be blocking and haven't come up with much useful. The fact that we haven't seen anything sitting on a lock isn't super surprising given that so much of ES appears to be non-blocking, but our CPU flame graphs also provide little evidence as to what it is doing - we see a bunch of segment merge work going on (which happens outside of these requests, too) and some IndexStats requests in-flight (I'm guessing from Kibana?), but nothing related to our in-flight search query.
Has anyone run into a similar situation, or debugged stuff like this before? From all appearances, these queries should be fast and ES is claiming they are, but they obviously are not in some cases, and any feedback would be really appreciated.
Some info about our setup:
ES Version: 7.5.2 - running on the official Docker image
Cluster setup: 3x AWS m5a instances, with gp2 io @ 1200 provisioned IOPS
Index size: 25m docs, 30 GB total size
Index setup: 3 shards, 2 replicas
Index rate: ~3/sec average once we've done our bulkload