High Request Overhead Not Captured by Took

Zach_Diemer · October 18, 2019, 8:09pm

Hey all! I've been trying to nail down where I can improve performance of my Elasticsearch cluster, and I've been specifically attempting to find out more about what is causing some high overhead in my end-to-end request.

I'm using Elasticsearch 6.4.2 and my client is Elasticsearch.Net 6.3.1 (NEST in .NET). I have my cluster deployed as a GKE deployment with six nodes. I have 11 indices, each with a single shard and 5 replicas. None of my indices are larger than about 4GB.

I'm capturing end-to-end request latency as well as the took parameter that Elastic reports back. Took is regularly only 50% of the end-to-end request latency. End-to-end is roughly 600ms on average. Even when I query using size: 0, I'm still seeing that overhead. I've done a pod-to-pod request within my Kubernetes cluster and found that the network latency is roughly 15-40ms, so I've ruled out network latency.

My NEST queries are quite long due to the fact that I'm doing a complex faceted multi-match query across my 11 indices. This requires a bunch of aggregations, filters, and post-filters. Additionally, I'm including a highlighter and score functions. An average NEST query turns out to about 800 lines of JSON.

Is this overhead really only due to serialization? Is there any way I can pare this down?

Zach_Diemer · October 21, 2019, 6:10pm

Hey all, just wanted to include this chart that shows a breakdown in our end-to-end request latency so y'all can see what I'm describing. search-took is the took value that Elasticsearch reports back, whereas search-query-ms-excluding-took is purely the await ElasticClient.SearchAsync method minus the took value.

(search-handler-ms-excluding-query and search-latency-web-full are additional overhead where we're pulling data from our database, etc.)

You can see how the search-query-ms-excluding-took regularly exceeds took. What's going on?

Armin_Braun · October 21, 2019, 7:17pm

Hi @Zach_Diemer,

I think the problem here is the combination of relatively high latency in your cluster and complicated queries. Assuming you have an average latency of 30ms and you're seeing 300ms, then 60ms of that is probably already explained by network latency (since there's quite some time between the server receiving the request and responding I'm estimating the latency as two separate network request approximately).
Sending 800 lines of JSON and deserialising def will take non-trivial time on the server so that likely is another chunk of your wall-time.
Also, you client will take non-trivial time to serialise those queries as well.
The same then goes for the query response (it has to be serialised on ES end and deserialised by the client).

One way of gaining some insight into the latency serialisation introduces into the end-to-end latency may be to record a few queries as JSON strings/files and running them via curl (or some other REST client) to isolate the time cost of serialising the query on the client?

system · November 18, 2019, 7:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to track down slow Elasticsearch queries? Elasticsearch	6	1383	April 16, 2019
High "took" time but low query time Elasticsearch	6	4377	August 19, 2020
Query timing: 'took' value and what I'm measuring Elasticsearch	7	46990	July 6, 2017
Elasticsearch NEST client 7.17 productivity investigation Elasticsearch docker , language-clients	3	274	January 5, 2024
Performance issues after upgrading ES and Nest to 5.4 from 1.7 Elasticsearch	10	1389	July 28, 2017

High Request Overhead Not Captured by Took

Related topics