High "took" time but low query time

tawik · July 17, 2020, 2:36pm

Hello,

We are trying to identify the source of high latency in our searches to our Elasticsearch cluster. Elasticsearch returns one field called "took" which is the time a request takes within ES (i.e. discounting time spent in transit over the network) in milliseconds. We've been using the Profile API to get detailed timing information on the request, and we're seeing ES requests report a "took" time of 4000ms - 5000ms for ~20% of our queries. Looking at the information provided in the "profile" section, we see that actual query execution took 1.7ms.

As I understand it, took time = query time + fetch time.

To reduce the risk of high fetch times, we turned off _source retrieval and added a routingID to our searches to ensure only 1 shard is ever involved. We are able to reproduce this issue from adhoc search requests with no other search traffic, so I don't believe it is caused by node thread contention or other scaling issues. We forced caching to be off at the request level but the same request still sometimes exhibits this behavior and at other times does not, executing in less than 10ms, so the issue is not deterministically tied to the search query.

I have two questions:

Has anyone seen anything similar or have any ideas where else latency could be introduced?
We've pretty much ruled out query time, and from our experiments fetch time seems good, but is there an API that explicitly gives us fetch time?

Sample search request output:

{
  "took": 5458,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 111,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "profile": {
    "shards": [
      {
        "id": "[truncated][truncated][11]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "time": "1.7ms",
                "time_in_nanos": 1728701,
                "breakdown": {
                  ...truncated...
                },
                "children": [
                  ...truncated...
                ]
              }
            ],
            "rewrite_time": 8160,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "143.1micros",
                "time_in_nanos": 143186,
                "children": [
                  {
                    "name": "EarlyTerminatingCollector",
                    "reason": "search_count",
                    "time": "46.5micros",
                    "time_in_nanos": 46599
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      }
    ]
  }
}

Elasticsearch version: 7.1

mayya · July 17, 2020, 10:18pm

"reason": "search_cancelled"

Do you know if your search request being cancelled because of timeout or manual cancellation?

Steve_Mushero · July 18, 2020, 2:37am

You've probably thought of these, but:

Any chance there is a GC going on in one of the affected nodes, which should be fast, but might cause some timeout or retry?

And of course you have no swap on any node, famous for inducing stalls.

Or does it correlate to a particular coord node, i.e. who handles the query & fetch work?

tawik · July 20, 2020, 2:56pm

I'm not sure how to confirm the search cancelled reason, but I'm not aware of any manually cancellation. The top level timed_out is false if that is related.

tawik · July 21, 2020, 6:23pm

Overriding the default index.refresh_interval setting seems to have reduced our latency significantly. I set it to 1s, which is apparently distinct from the default.

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html

It would be nice if there was a way to deduce this from the "took" time / profile API, since for us, it came up as a result of random trial & error.

Steve_Mushero · July 22, 2020, 4:20am

Interesting - in theory should not matter much, but I can see how low search rate and default behavior (which will refresh on search) can cause latency of a few seconds, depending on the shards (though it should multi-thread and refresh them all in parallel, I think, though things are never that clean).

Note you can also set the interval longer like 30s to improve indexing performance and lower loads, and the queries will just ignore that pre-refresh data, but at least they won't block on it.

system · August 19, 2020, 4:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
'took' field on the ES response Elasticsearch	3	6865	May 7, 2018
Profile API time does not match top-level "took" Elasticsearch	2	998	July 5, 2017
Query timing: 'took' value and what I'm measuring Elasticsearch	7	46380	July 6, 2017
Response time vs took Elasticsearch	4	3283	July 13, 2023
ElasticSearch ProfileAPI Elasticsearch	6	825	March 30, 2017

High "took" time but low query time

Related topics