I am using Elastic Cloud for my Elasticsearch instance. My issue is that sometimes (rarely) I'm seeing strange behavior where the curl time (round trip from my web server to Elastic Cloud) is taking 8+ seconds but the 'took' time in the response is approx. 45ms. It's not the same query every time and doesn't happen often (maybe once or twice a day out of millions of queries). Every time I run the query manually, it's super fast and the Profiler says it looks good.
This is the typical behavior, but about every 5 days, I get a spraying of these where it happens to every query I'm sending to Elastic at the same time. So I end up with about 50-60 queries that are slow in curl time but fast in 'took' time. I do an auto-retry and they work just fine.
My network guy has looked and said we are seeing 0 return packets, not even completing the 3-way handshake from the Elastic Cloud during these spraying events. We have a huge pipe and have no other network events at these times.
Does anyone have any ideas?
This sounds like the web server handling the API requests for Elasticsearch is getting overloaded, but I don't know all the pieces in the stack.
took is the actual query time within elasticsearch ... The time it took to execute the query in elasticsearch once elasticsearch receives the query and the is ready to return the results
.. it is not the roundtrip http request / response time.
So with the above explanation there are many reasons why curl can take longer.. unfortunately it is my experience that intermittent network delays can be difficult to catch / diagnose
And that is not to say it may be all on your side .. in Elastic Cloud there are some components between the your Elasticsearch Cluster.
There is an Edge Proxy...
But that said if we were have repeated issues I suspect we would be getting a number of alerts / calls...
Thanks for the response. That was my understanding of what 'took' is, which is why I posted my predicament here.
My network guys says "not us", so I'm just trying to figure out the pieces in between the time that 'took' is calculated and the internet. You mentioned the Edge Proxy. Is there anything else?
Like I said there is only one component that I ever really looked at... That's the proxy sits in front... But that is highly highly monitored as all our customer traffic passes through those (there are many distributed ) If there's any latency issues the team is usually directly on it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.