I have intermittent but reproduceable slow requests when I get a record via /index/type/id (the most basic request possible for Elasticsearch, with no body) - instead of 1-2ms it sometimes takes up to 220ms. During many many tests I noticed the following behavior:
- A larger response by ES increases the chances of the response being slow
- I can consistently reproduce slow requests for some larger objects, even if I request the same object over and over again. For example when I request one specific object 3000x in a loop with an existing connection, about 0.2%-0.8% will be slow. So at least 99% return in under 2ms, but up to 0.8% need between 40ms and 220ms. If the response is small enough, there are no "reproduceable" slow requests.
- Using _mget instead of a simple GET increases the chances of a slow response by a large factor (at least 4-8x), meaning multiple GET API requests are much less likely to be exceptionally slow than one _mget request for the same data (in my case for four IDs)
- There is no apparent connection to CPU load, JVM memory usage or anything else I was able to monitor. It still occurs even when absolutely nothing is going on, nothing is being indexed, and no other requests are being handled.
- There is even the opposite effect: If I bombard ES with GET API requests, the likelyhood of slow requests goes down by a lot (at least 90%). But this effect only holds while issuing one GET requests after another, not if there are only "some" requests. It needs a huge pile of requests to avoid slow responses (like 600 requests/second). This of course seems like a not-so-helpful long-term solution to speed up ES.
- My servers are new & powerful, my index is quite small (only about 200MB with 150'000 records), ES has 5GB of RAM and is running on cutting edge Intel SSDs.
Some more information is contained on http://stackoverflow.com/questions/36272601/find-causes-for-slower-elasticsearch-responses and https://github.com/elastic/elasticsearch/issues/17451, but I tried to summarize the most important details in the list above.
I hope somebody has had some similar experiences or knows something more I can try out - or some way to test what the root cause is. One of the few things I thought could maybe be another reason was that the whole SSD is encrypted and mounted by Truecrypt - but no other application has these kind of problems, and there is a huge amount of free RAM on the server to cache file system access (and the SSD is on a solid hardware RAID with BBU).
Update: When running a simple search for a keyword in a loop 1000 times, the same "slow" responses occur with only a slightly higher probability, but Elasticsearch tells me it "took" 2-3ms, even when it actually takes 100ms or 200ms until the response reaches my application. I have taken into account that it might be a network "problem", but it would be a strange root cause, because my setup is so standard. Also, keep-alive connections and TCP_NODELAY do nothing to improve this problem.
So, any ideas what I could try next?