How to track down slow Elasticsearch queries?


#1

Hi there,

I need some advice on how to diagnose slow Elasticseach queries.

Setup

  • 2 node cluster in ElasticCloud (1 primary shard, 1 replica). note: ElasticCloud = no slowlog.
  • Interacting with cluster via my Azure .NET Web App, using the NEST library

Behaviour

  • Most response times for my web server are 50-80ms
  • All query times in ES (e.g took) are < 5ms.
  • Network latency between my web server and ElasticCloud is about 15ms

Problem
Sometimes, the response times jump between 100-200ms, but the took is still 1ms. I was able to replicate this behaviour on local too (using ElasticSearch docker).

Here's a trace from Fiddler i captured, which is the call to Elasticsearch from my app:

ClientConnected:            17:28:44.325
ClientBeginRequest:    17:34:34.953
GotRequestHeaders:    17:34:34.953
ClientDoneRequest:    17:34:34.953
Determine Gateway:    0ms
DNS Lookup:         0ms
TCP/IP Connect:            0ms
HTTPS Handshake:    0ms
ServerConnected:            17:34:27.538
FiddlerBeginRequest:    17:34:34.953
ServerGotRequest:    17:34:34.953
ServerBeginResponse:    17:34:35.171
GotResponseHeaders:    17:34:35.171
ServerDoneResponse:    17:34:35.172
ClientBeginResponse:    17:34:35.172
ClientDoneResponse:    17:34:35.178

So, above is saying the Elasticsearch cluster took 218ms to process the request. However the took is 1ms.

How can i track this slow request down? Clearly it's not the query speed (since took is low), so it must be something in the cluster.

Any advice?


(Daniel Mitterdorfer) #2

Hi,

the took timer only starts to tick once the search is ready to be processed so I wonder whether in your case you have a lot of requests queued up. As you mention, the slowlog would not be of any use because once the query is ready to be processed, it's executed quickly. I would start by checking the node stats API and watch the thread pools' queue lengths.

Daniel


#3

@danielmitterdorfer thanks for jumping in on this. much appreciated :slight_smile:

If you check my Stackoverflow question i also raised with the same issue, you'll see all the stats from Kibana, including 'read threads': https://stackoverflow.com/questions/54857104/how-to-track-down-slow-elasticsearch-queries

I assume this is the same stat as the one you're wanting from the node stats API? Also - that API would only ever give me a 'point in time' representation... which is hard to get during a load test, moreso since the 'slowness' only appears to occur for a few seconds.

So - does the StackOverflow pics i posted help in any way? or is there more stats i should provide/look into?

Many thanks again. Been stumped on this one for a while, so glad someone is helping me out :slight_smile:


(Daniel Mitterdorfer) #4

I had a look at the charts but nothing stands out immediately: GC is fine, also the search thread pool's queue never spikes although you might be fooled by the sampling frequency, I'd still double-check the raw numbers via the node stats API as I've suggested, especially the thread pool statistics. If that does not reveal anything I'd take a closer look at the network between your client and Elasticsearch.


#5

K, I'll try the node stats API.

I don't believe it's the network. Proof of that is the Fiddler trace I captured. You can see full trace in Stack overflow question, but the pertinent part is this:

ServerGotRequest:    17:34:34.953 
ServerBeginResponse:    17:34:35.171 

So that's 200+ms for Elasticsearch to produce a response, eg not on the external network.

I'm wondering if it's an elastic cloud network issue? Between the ELB and the nodes etc?

Is there anywhere I can see traces of requests coming into the cluster, and going out?


#6

@danielmitterdorfer

regarding the node stats API, what numbers should i be looking for?
i'm guessing we're interested in this part:

"search" : {
          "threads" : 4,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 4,
          "completed" : 13020477
        },

Should i just be looking to make sure the queue is empty?

Also - isn't this just the same as the 'read threads' stat i posted in StackOverflow?


(system) closed #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.