'took' time fast, curl sometimes slow

I am using Elastic Cloud for my Elasticsearch instance. My issue is that sometimes (rarely) I'm seeing strange behavior where the curl time (round trip from my web server to Elastic Cloud) is taking 8+ seconds but the 'took' time in the response is approx. 45ms. It's not the same query every time and doesn't happen often (maybe once or twice a day out of millions of queries). Every time I run the query manually, it's super fast and the Profiler says it looks good.

This is the typical behavior, but about every 5 days, I get a spraying of these where it happens to every query I'm sending to Elastic at the same time. So I end up with about 50-60 queries that are slow in curl time but fast in 'took' time. I do an auto-retry and they work just fine.

My network guy has looked and said we are seeing 0 return packets, not even completing the 3-way handshake from the Elastic Cloud during these spraying events. We have a huge pipe and have no other network events at these times.

Does anyone have any ideas?

This sounds like the web server handling the API requests for Elasticsearch is getting overloaded, but I don't know all the pieces in the stack.

How can 'took' be fast and curl be slow?

Thanks in advance

Hi @ryans

took is the actual query time within elasticsearch ... The time it took to execute the query in elasticsearch once elasticsearch receives the query and the is ready to return the results

.. it is not the roundtrip http request / response time.

So with the above explanation there are many reasons why curl can take longer.. unfortunately it is my experience that intermittent network delays can be difficult to catch / diagnose

And that is not to say it may be all on your side .. in Elastic Cloud there are some components between the your Elasticsearch Cluster.

There is an Edge Proxy...

But that said if we were have repeated issues I suspect we would be getting a number of alerts / calls...

Thanks for the response. That was my understanding of what 'took' is, which is why I posted my predicament here.
My network guys says "not us", so I'm just trying to figure out the pieces in between the time that 'took' is calculated and the internet. You mentioned the Edge Proxy. Is there anything else?

:slight_smile:

Do a traceroute and look at all the hops in between... could be any of them...
Intermittent Networking debugging... super hard....

If you can repeat it ... OR you have captured it Support can look on our Side up to our Edge... but there is a lot in between I suspect.

1 Like

Thanks. Do you know of any pieces within the Elastic Cloud stack that could be investigated after the 'took' time is calculated?

Like I said there is only one component that I ever really looked at... That's the proxy sits in front... But that is highly highly monitored as all our customer traffic passes through those (there are many distributed ) If there's any latency issues the team is usually directly on it.

Thanks. I thought you knew of some others based on your previous reply.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.