High "took" time but low query time

Hello,

We are trying to identify the source of high latency in our searches to our Elasticsearch cluster. Elasticsearch returns one field called "took" which is the time a request takes within ES (i.e. discounting time spent in transit over the network) in milliseconds. We've been using the Profile API to get detailed timing information on the request, and we're seeing ES requests report a "took" time of 4000ms - 5000ms for ~20% of our queries. Looking at the information provided in the "profile" section, we see that actual query execution took 1.7ms.

As I understand it, took time = query time + fetch time.

To reduce the risk of high fetch times, we turned off _source retrieval and added a routingID to our searches to ensure only 1 shard is ever involved. We are able to reproduce this issue from adhoc search requests with no other search traffic, so I don't believe it is caused by node thread contention or other scaling issues. We forced caching to be off at the request level but the same request still sometimes exhibits this behavior and at other times does not, executing in less than 10ms, so the issue is not deterministically tied to the search query.

I have two questions:

  1. Has anyone seen anything similar or have any ideas where else latency could be introduced?
  2. We've pretty much ruled out query time, and from our experiments fetch time seems good, but is there an API that explicitly gives us fetch time?

Sample search request output:

{
  "took": 5458,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 111,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "profile": {
    "shards": [
      {
        "id": "[truncated][truncated][11]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "time": "1.7ms",
                "time_in_nanos": 1728701,
                "breakdown": {
                  ...truncated...
                },
                "children": [
                  ...truncated...
                ]
              }
            ],
            "rewrite_time": 8160,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "143.1micros",
                "time_in_nanos": 143186,
                "children": [
                  {
                    "name": "EarlyTerminatingCollector",
                    "reason": "search_count",
                    "time": "46.5micros",
                    "time_in_nanos": 46599
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      }
    ]
  }
}

Elasticsearch version: 7.1

"reason": "search_cancelled"

Do you know if your search request being cancelled because of timeout or manual cancellation?

You've probably thought of these, but:

Any chance there is a GC going on in one of the affected nodes, which should be fast, but might cause some timeout or retry?

And of course you have no swap on any node, famous for inducing stalls.

Or does it correlate to a particular coord node, i.e. who handles the query & fetch work?

I'm not sure how to confirm the search cancelled reason, but I'm not aware of any manually cancellation. The top level timed_out is false if that is related.

Overriding the default index.refresh_interval setting seems to have reduced our latency significantly. I set it to 1s, which is apparently distinct from the default.

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html

It would be nice if there was a way to deduce this from the "took" time / profile API, since for us, it came up as a result of random trial & error.

1 Like

Interesting - in theory should not matter much, but I can see how low search rate and default behavior (which will refresh on search) can cause latency of a few seconds, depending on the shards (though it should multi-thread and refresh them all in parallel, I think, though things are never that clean).

Note you can also set the interval longer like 30s to improve indexing performance and lower loads, and the queries will just ignore that pre-refresh data, but at least they won't block on it.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.