Sparse slow NEST Searches

Hi everyone,

We are using ElasticSearch Nest client to build an API with Dot Net Core to run queries towards an ElasticSearch cluster. Most of the times, we run the same type of query with a filter that's provided as part of the API call, so we have wrapped it into a method like this:

var stopWatch = new Stopwatch();    // stopwatch added to investigate the issue
stopWatch.Start();

var searchRequest = new SearchRequest<Price>
 {
     Size = 0,
     Query = new BoolQuery()
     {
         Must = _filterTermQueries_
     },
     Aggregations = aggrs
 };
var results = indexClient.Search<Price>(searchRequest);

timeTracking.Enqueue($"Search done - {stopwatch.ElapsedMilliseconds} - {results.Took}");

Most of the time, the stopwatch time is +-50 ms more than the results.Took time, understandable due to serialization and deserialization of the request/response.

The problem is that, every once in a while, the difference between the two is up to 3 seconds, hence the API takes more than 3 seconds to answer even though ElasticSearch indexes only spent milliseconds finding the data.
Note this happens with load (via JMeterLoadTest) and without load (single call via postman).

I've read in this discussion that: "Bear in mind that the first query to Elasticsearch from NEST will be much slower because the client caches a lot of delegates, compiled expressions and json serialization properties on this first call."

Hence I question myself:

  1. For how long are those things cached? Could it be that those caches are evicted every once in a while and we have to pay this price again?
  2. Is there any way to gather extra information about them at runtime?

Any help would be appreciated, thanks!

Hey @Xavi_Ametller,

The caches used are unbounded so no previously cached data is evicted for the lifetime of the AppDomain.

Could you provide some additional detail on your setup?

  1. What version of NEST are you using?
  2. What OS/Runtime are you running on? You mention .NET Core, is this on Windows, Linux, etc?

You should be able to get more information from the .DebugInformation property on the response. If this isn't yielding any further insight, you could also look at configuring network tracing.

Hi @forloop, thanks for your pointers,

First of all, let me answer your questions:

  1. We are using NEST 5.2.0.
  2. We are running on Windows, more concretely an Azure WebApp on a S1 Service Plan (1 core, 1.75 GB Ram). The API is the only app running on the service plan.

If I look at the .DebugInformationof the response when we have this problem (I can reproduce it running the API locally with the index in the cloud) I can see that the API call sometimes takes 2 orders of magnitude more than the response from Elastic:
Valid NEST response built from a successful low level call on POST: /Index/price/_search
# Audit trail of this API call:
- [1] HealthyResponse: Node: XXX Took: 00:00:00.7833709
# Request:
{
...
}
# Response:
{"took":95,"timed_out":false,"_shards":{"total":80,"successful":80,"failed":0}, ... }

On top of that, I'd like to add that if we repeat the same exact query several times, performance is good most of the times, so it doesn't seem to be query-related.

To be fair, it is not clear to me what operations are included in the first took (the slow one) - I understand the number is obtained by NEST on the client side but correct me if I'm wrong. If that were the case, is there any way to obtain profiling information about how much each operation took on NEST side (serialize request, send, receive answer, build response object).

In any case, if the caches are never evicted, then I would expect that serializing the request and building the response should take a negligible amount of time.

I would like to add that we use this method as part of a bunch of tasks that are executed in parallel, so we also considered if the time could be caused by task switching. We have replaced the call to the method by a Sleep and the problem is not happening anymore.
We also have tried to change the execution to a synchronous implementation (without tasks) and the problem persists but happens less often, so we wonder if maybe its a combination of NEST + tasks...

Any insights will be appreciated, thanks!

Just for the record, I have identified what seems to be the problem (although have not fixed it). Seems that Dot net core and Nancy FX do not play very well together and the problem has nothing to do with NEST.
To demonstrate it, I have created an endpoint in the API which returns a constant string (hello world stuff). That endpoint also has sparse peaks of bad performance as well.

Also reinforcing the hypothesis, I have moved the API to .Net 4.5 and WebApi 2.0 and everything works now like a charm (hello world and NEST) with much more higher throughput thanks to the consistently slow response times.
Would love to know where the real problem is, but since the API is a proof of concept I decided not to spend more time investigating this.
Kind regards,

Thanks for the update @Xavi_Ametller, glad you got to the bottom of it :smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.