How to track down slow Elasticsearch queries?

Hi there,

I need some advice on how to diagnose slow Elasticseach queries.

Setup

  • 2 node cluster in ElasticCloud (1 primary shard, 1 replica). note: ElasticCloud = no slowlog.
  • Interacting with cluster via my Azure .NET Web App, using the NEST library

Behaviour

  • Most response times for my web server are 50-80ms
  • All query times in ES (e.g took) are < 5ms.
  • Network latency between my web server and ElasticCloud is about 15ms

Problem
Sometimes, the response times jump between 100-200ms, but the took is still 1ms. I was able to replicate this behaviour on local too (using ElasticSearch docker).

Here's a trace from Fiddler i captured, which is the call to Elasticsearch from my app:

ClientConnected:            17:28:44.325
ClientBeginRequest:    17:34:34.953
GotRequestHeaders:    17:34:34.953
ClientDoneRequest:    17:34:34.953
Determine Gateway:    0ms
DNS Lookup:         0ms
TCP/IP Connect:            0ms
HTTPS Handshake:    0ms
ServerConnected:            17:34:27.538
FiddlerBeginRequest:    17:34:34.953
ServerGotRequest:    17:34:34.953
ServerBeginResponse:    17:34:35.171
GotResponseHeaders:    17:34:35.171
ServerDoneResponse:    17:34:35.172
ClientBeginResponse:    17:34:35.172
ClientDoneResponse:    17:34:35.178

So, above is saying the Elasticsearch cluster took 218ms to process the request. However the took is 1ms.

How can i track this slow request down? Clearly it's not the query speed (since took is low), so it must be something in the cluster.

Any advice?

Hi,

the took timer only starts to tick once the search is ready to be processed so I wonder whether in your case you have a lot of requests queued up. As you mention, the slowlog would not be of any use because once the query is ready to be processed, it's executed quickly. I would start by checking the node stats API and watch the thread pools' queue lengths.

Daniel

1 Like

@danielmitterdorfer thanks for jumping in on this. much appreciated :slight_smile:

If you check my Stackoverflow question i also raised with the same issue, you'll see all the stats from Kibana, including 'read threads': https://stackoverflow.com/questions/54857104/how-to-track-down-slow-elasticsearch-queries

I assume this is the same stat as the one you're wanting from the node stats API? Also - that API would only ever give me a 'point in time' representation... which is hard to get during a load test, moreso since the 'slowness' only appears to occur for a few seconds.

So - does the StackOverflow pics i posted help in any way? or is there more stats i should provide/look into?

Many thanks again. Been stumped on this one for a while, so glad someone is helping me out :slight_smile:

I had a look at the charts but nothing stands out immediately: GC is fine, also the search thread pool's queue never spikes although you might be fooled by the sampling frequency, I'd still double-check the raw numbers via the node stats API as I've suggested, especially the thread pool statistics. If that does not reveal anything I'd take a closer look at the network between your client and Elasticsearch.

K, I'll try the node stats API.

I don't believe it's the network. Proof of that is the Fiddler trace I captured. You can see full trace in Stack overflow question, but the pertinent part is this:

ServerGotRequest:    17:34:34.953 
ServerBeginResponse:    17:34:35.171 

So that's 200+ms for Elasticsearch to produce a response, eg not on the external network.

I'm wondering if it's an elastic cloud network issue? Between the ELB and the nodes etc?

Is there anywhere I can see traces of requests coming into the cluster, and going out?

@danielmitterdorfer

regarding the node stats API, what numbers should i be looking for?
i'm guessing we're interested in this part:

"search" : {
          "threads" : 4,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 4,
          "completed" : 13020477
        },

Should i just be looking to make sure the queue is empty?

Also - isn't this just the same as the 'read threads' stat i posted in StackOverflow?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.