GET API requests sometimes slow (up to 200ms instead of 2ms)

I have intermittent but reproduceable slow requests when I get a record via /index/type/id (the most basic request possible for Elasticsearch, with no body) - instead of 1-2ms it sometimes takes up to 220ms. During many many tests I noticed the following behavior:

  • A larger response by ES increases the chances of the response being slow
  • I can consistently reproduce slow requests for some larger objects, even if I request the same object over and over again. For example when I request one specific object 3000x in a loop with an existing connection, about 0.2%-0.8% will be slow. So at least 99% return in under 2ms, but up to 0.8% need between 40ms and 220ms. If the response is small enough, there are no "reproduceable" slow requests.
  • Using _mget instead of a simple GET increases the chances of a slow response by a large factor (at least 4-8x), meaning multiple GET API requests are much less likely to be exceptionally slow than one _mget request for the same data (in my case for four IDs)
  • There is no apparent connection to CPU load, JVM memory usage or anything else I was able to monitor. It still occurs even when absolutely nothing is going on, nothing is being indexed, and no other requests are being handled.
  • There is even the opposite effect: If I bombard ES with GET API requests, the likelyhood of slow requests goes down by a lot (at least 90%). But this effect only holds while issuing one GET requests after another, not if there are only "some" requests. It needs a huge pile of requests to avoid slow responses (like 600 requests/second). This of course seems like a not-so-helpful long-term solution to speed up ES.
  • My servers are new & powerful, my index is quite small (only about 200MB with 150'000 records), ES has 5GB of RAM and is running on cutting edge Intel SSDs.

Some more information is contained on http://stackoverflow.com/questions/36272601/find-causes-for-slower-elasticsearch-responses and https://github.com/elastic/elasticsearch/issues/17451, but I tried to summarize the most important details in the list above.

I hope somebody has had some similar experiences or knows something more I can try out - or some way to test what the root cause is. One of the few things I thought could maybe be another reason was that the whole SSD is encrypted and mounted by Truecrypt - but no other application has these kind of problems, and there is a huge amount of free RAM on the server to cache file system access (and the SSD is on a solid hardware RAID with BBU).

Update: When running a simple search for a keyword in a loop 1000 times, the same "slow" responses occur with only a slightly higher probability, but Elasticsearch tells me it "took" 2-3ms, even when it actually takes 100ms or 200ms until the response reaches my application. I have taken into account that it might be a network "problem", but it would be a strange root cause, because my setup is so standard. Also, keep-alive connections and TCP_NODELAY do nothing to improve this problem.

So, any ideas what I could try next?

Elasticsearch tells me it "took" 2-3ms, even when it actually takes 100ms or 200ms until the response reaches my application

Does that ^ apply for the GET /index/type/id, as well?

API GET requests do not contain the information on how long it took Elasticsearch, because it is no search - the "took" information is only contained when ES has to search through something.

But I suspect it is the same - the results are consistent when I run search requests, ES never says it needs more than 3ms, although some requests take 30ms, some 200ms, while 99% actually take 3ms as advertised. But I have no idea how to find out where the additional delay happens - if within ES, or on the network, or something else.

Can you test this outside your application? Like running the requests on the node itself using curl.

Great idea! I did write a PHP command line script to issue the same requests over and over, and the delays almost disappear when I am running it on the node itself - the maximum time measured goes down to about 19ms, and only one in 3000 requests usually has a discrepancy (curl takes 15ms, but ES tells me it took 4ms), and only about 1-3 requests of 3000 are above 10ms (but Elasticsearch says it actually used that time itself), which is much nicer.

I then further investigated my servers - and I noticed that on my "older" servers (about 8 years old, while the others are 2 years old) there were actually no additional slow requests. The results were exactly like running the results on the node itself, and never above 20ms.

On my new web servers I tried doing the requests when nothing else was going on (stopped all request processing) and I then got fewer slow requests - maybe one in 10'000 was above 100ms. As soon as I start using the server to process web traffic, it gets a lot slower: more requests above 100ms, and also a lot more above 10ms, although the server is hardly breaking a sweat (1-2% cpu load) and it is not like I am processing 10'000 requests per second, more like maybe 10-100 per second.

I reran these tests on all servers - my 4 new servers, which are more or less identical, all have the same (bad) behavior. Requests to themselves are as fast as would be expected, but to somebody else there are some slow requests/responses. On my 3 "older" servers (2 are identical, and 1 is completely different) there are no slow responses - I ran more than 200'000 requests without a single one above 20ms.

So, what are the new servers doing wrong? It seems unlikely that 4 servers are having the same hardware problems and behaving in the exact same way. The servers are connected with a Gbit Switch with nothing else between them, so a problem in the network itself seems unlikely.

I started comparing the ipv4 settings of sysctl, but there are no differences for ipv4 config except net.ipv4.tcp_mem (which would be different on a server which has 8x as much RAM) and differences in tcp_max_orphans, tcp_max_syn_backlog and tcp_max_tw_buckets, which would not really explain the behavior.

I am using Debian Jessie on all servers with default settings - the only thing I increased were file limits. Any more ideas what this could be?

I checked on the drivers and network cards in the servers - the old servers are using the tg3 drivers, and the new servers (SuperMicro boards) are using e1000e (Intel) drivers. The Intel cards should be far better, so ... strange.

I had this issue when I first started using ES. Turns out I was running it without "-server" parameter, which caused initial requests to "wake up" elastic search and thus take longer. Subsequent requests were faster. After a while ES would go back into "sleep mode", and another request would again take more time, in order to "wake it up".

Does this apply to you? It's easy to test, if you send a number of requests in a row, and only first one is delayed.

I finally solved it - it was the network card, probably the hardware implementation of "offloading", meaning parts of network processing is done by the network card itself instead of the CPU. When I ran

ethtool -K eth1 tx off rx off sg off tso off ufo off gso off gro off lro off rxvlan off txvlan off rxhash off

The problems disappeared. So this seems to be a problem with the Intel network hardware (Intel® 82574L Dual port GbE LAN on a SuperMicro X9SRL-F running the e1000e driver), while my older servers with tg3 driver (offloading enabled) work just fine without any "tweaking". I am still surprised that professional network hardware by Intel has these serious problems, because if I have these problems then probably thousands of others also have them without knowing it (it took me 2 years to notice, although I am a performance fanatic - you really have to analyze your server stack quite a bit to notice max. 200ms slower responses in 1% of requests).

4 Likes

Wow, this is both amazing and super scary!
We have run into what seems to be the exactly the same problem, the difference is that we are on AWS (Frankfurt). Can it be that thousands and yet thousands of AWS servers have this problem. What a scandal in that case!

About to open a ticket with AWS, lets see how this unravels...

Thanks for an amazing job hunting this down, with some luck, your findings will help us as well!

1 Like

Would be very interested in knowing what you find out!

I actually noticed that not all of the problems went away - the amount of slow requests to ES reduced a lot and are not reproducible anymore, but still about 0.05% of requests are slow without any explanation (200ms instead of 2ms). That is a level which at least I can live with, but I still find it irritating, because I fine-tuned everything and there is no sensible reason why these requests take that long and why there is a consistent number of these slow requests (it's between 0.03 and 0.07% every day, no matter how many requests or what else the servers are doing).

So something strange is going on with networking or maybe just these networking cards in general. I also had another bizzar but slightly different occurence: While repopulating a filesystem cache of pictures delivered by the database for 6 days (so the servers were working a lot more and pumping the data from the DB/ES server every minute) I noticed the average time it took to generate a website response got faster - it went from 18ms to 13-14ms. After the cache was done, it went back up to 18ms. And the site gets about 150'000 requests every day, so being so much faster in average is quite a big deal. It seems strange that servers would noticeably speed up when given more work. The number of slow requests stayed the same by the way. Strange things seems to be going on with networking performance in general, without a way to speed things up consistently.

1 Like