Search request's timeout doesn't work as expected

hari-ram-s · March 22, 2023, 12:45pm

ES Verson: 7.15.0

We were exploring the significant_text plugin of ES (via REST API) for generating word cloud from our data. As the query took more time to execute, we decided to use a timeout. Here is what the official documentation of timeout states.

Each shard collects hits within the specified time period. If collection isn’t finished when the period ends, Elasticsearch uses only the hits accumulated up to that point. The overall latency of a search request depends on the number of shards needed for the search and the number of concurrent shard requests.

And here is my REST API.

GET /<24-indices>/_search
{
  "timeout": "500ms",
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "channelId": [
              // 50+ channel ids
              ]
          }
        },
        {
          "range": {
            "postPublishedOn": {
              "gte": "2021-01-01",
              "lte": "2022-12-31"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "sample": {
      "sampler": {
        "shard_size": 10000
      },
      "aggs": {
        "keywords": {
          "significant_text": {
            "field": "caption",
            "filter_duplicate_text": true,
            "size": 25
          }
        }
      }
    }
  }
}

Here is the result

{
  "took": 29652,
  "timed_out": true,
  "_shards": {
    "total": 24,
    "successful": 24,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": 0,
    "hits": [
      ...
      ]
  },
  "aggregations": {
    ...
  }
}

As we can see, the request is timed out but still took around 30s despite setting the timeout to 500ms. What am I missing here?
The search went through 24 shards. As per the doc, some of these 24 requests would have executed parallelly with each of them having a timeout of 500ms. Even if it is executed sequentially, 24 x 500 ms = 12s. Which is way below 30s.
So here are my questions.

Will the requests to different shards execute parallel in all circumstances? If not, then in what scenario would they execute sequentially?
In either case, why is the query taking more time despite getting timed out?
This one is about significant_text plugin. Does this plugin really take that long to complete execution? Below are more details about my dataset.

Shard details:-

{
  "docs": {
    "count": 106287465,
    "deleted": 30635607
  },
  "shards": {
    "total_count": 24
  },
  "store": {
    "size_in_bytes": 106441596761,
    "total_data_set_size_in_bytes": 106441596761,
    "reserved_in_bytes": 0
  }
}

Approx matched doc count: 35k

Mark_Harwood1 · March 22, 2023, 8:37pm

Hi,
The timeout setting is best-efforts and relies on timer checks woven into the various “tight loops” in the search code.
One of the heavy loops that is missing timer checks is the one in the significant_text aggregation where it checks the background frequency of all the terms found in matches. For this reason (and to limit memory usage) it is recommended that the “sampler” aggregation is used in conjunction with the significant_text aggregation to limit the number of matching docs analysed and therefore the number of matching terms. Another way to limit the number of background frequency checks is to increase the value of “shard_min_doc_count” eg to 3 or 4 to only consider terms seen in this number of docs.

hari-ram-s · March 23, 2023, 8:41am

You were right about the background frequency. But setting 'shard_min_doc_count' to 3 or 4 has no effect in the performance for my data. Setting it to a higher value (~100) has some positive impact on the performance, but that won't be feasible as the result might filter out a lot of terms.

Mark_Harwood1 · March 23, 2023, 9:25am

Like the docs for significant_text say - use the sampler aggregation:

Re-analyzing large result sets will require a lot of time and memory. It is recommended that the significant_text aggregation is used as a child of either the sampler or diversified sampler aggregation to limit the analysis to a small selection of top-matching documents e.g. 200. This will typically improve speed, memory use and quality of results.

hari-ram-s · March 23, 2023, 9:54am

Yes we are using sampler. You can check that in my query. Based on our analysis, it takes approximately 1 second to aggregate 1k records, which looks much slower for me. In the example that I have provided, 35k records are aggregated which took 30s.

Is it normal to take 1s to aggregate 1k samples in significant_text?

Mark_Harwood1 · March 23, 2023, 10:56am

Ah sorry. Missed that.
The significant_text aggregation reanalyses the text of documents to get at the words used. That is the process that happens at index time so will be roughly similar to indexing speeds. The filter_duplicate_text setting will also add some overhead.

Note also that time-based significance doesn’t work too well on time-based indices. It’s hard/impossible for a shard to determine what is the difference between this week’s content and previous weeks if it doesn’t hold any of the previous weeks’ documents.

hari-ram-s · March 23, 2023, 12:23pm

We haven't configured any analysers. So I don't think there could be any indexing overhead. The 'caption' field that we use in the significant_text is a text field.

And I tried removing filter_duplicate_text param. No performance impact. I also tried to add the 'caption' field to the _source in query. No change in the performance.

Should I file a github issue for this as we are moving away from the original topic?

Mark_Harwood1 · March 23, 2023, 3:46pm

You will be using the default analyzer which, like all analyzers, has overhead.
There is no bug here. You could try smaller sample sizes but summarising a year’s worth of content with a small number of docs is possibly a big ask.

hari-ram-s · March 23, 2023, 4:17pm

The sample that we use itself isn't really huge. Even for some 10k docs, it would take around 10s to aggregate. Is this usual in singinficant_text plugin?

Mark_Harwood1 · March 23, 2023, 4:45pm

The cost is linear with:

the number of docs sampled
the length of the text in each of the docs
the number of unique terms found in the sample that need their background frequency looking up.

The “usual” cost depends on all these factors.
I haven’t seen anything in your trials that suggests the observed costs aren’t following this profile.

hari-ram-s · March 24, 2023, 7:35am

I found that our data index at a speed of 1 ms/ doc. Which explains the latency.

Is there any other plugin that directly uses the analysed token rather than re-analysing? I saw significant_term and terms aggregation. But not sure about their internal implementation.

Posting this in github - significant_text performance issue · Issue #94712 · elastic/elasticsearch · GitHub

Mark_Harwood1 · March 24, 2023, 8:40am

Lucene has an option to store term vectors (the raw output of analysis) but that costs extra in disk storage and reading this expanded data from disk is not substantially faster than re-analysis of the original text in many cases

I’d recommend closing this GitHub issue as it is more of a question than a bug or proposed fix.

hari-ram-s · March 26, 2023, 3:11pm

Hey @Mark_Harwood1, as stated in the GitHub issue, why isn't there a parallel execution? The data split in 24 shards, which should complete the query 24 times faster than the indexing time. Why is that not the case here?

DavidTurner · March 26, 2023, 3:46pm

I think you are assuming that searches happen in parallel across the shards whereas indexing does not. This isn't the case: indexing is also parallelised across shards.

Mark_Harwood1 · March 26, 2023, 5:00pm

There is.
Parallelisation doesn't help greatly with this problem anyway - answering the question with decent quality requires a global view of term stats and using a distributed data model for parallelisation doesn't lend itself to that. Large, partial sets of term stats need to be streamed from data nodes to the coordinating node for consideration.

It's probably best for you to assume we have spent a good deal of time optimising the performance of this feature and that it is one of those tasks that will always remain expensive. Better perhaps to invest your effort in considering how your application may pre-compute and cache answers to your expensive questions - e.g. I can't imagine summarising a year's worth of content is something every user needs to do on-demand. For example, on my news-analysis website I summarise a year's worth of news headlines in a background batch process rather than computing it over and over for each unique site visitor.

hari-ram-s · March 26, 2023, 6:32pm

Hey David. I found the indexing speed via the _stats api and I assume the time specified there is the summation of the indexing time regardless of async indexing.

"indexing": {
  "index_total": 376426353,
  "index_time_in_millis": 414117588,
  "index_current": 0,
  "index_failed": 0,
  "delete_total": 0,
  "delete_time_in_millis": 0,
  "delete_current": 0,
  "noop_update_total": 0,
  "is_throttled": false,
  "throttle_time_in_millis": 0
}

The above stat tells us that the average time taken for indexing is 1 ms per doc

hari-ram-s · March 26, 2023, 6:34pm

That's our use case. Perhaps we could try some other ways for achieving it.

Anyways, thanks for putting in the effort!

DavidTurner · March 26, 2023, 7:24pm

That's true, this is the total time spent indexing across all indexing threads.

system · April 23, 2023, 7:25pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic Timeout not working Elasticsearch	6	890	October 19, 2022
ES Timeout issues? Elasticsearch	7	1340	July 6, 2017
Elastic Search Timeout Property for Search Elasticsearch	1	435	April 6, 2018
Difference between the timeout param mentioned in query-string parameters and request body Elasticsearch	1	2256	December 15, 2019
Version 5.5.1 - timeout not working Elasticsearch	1	429	March 8, 2018

Search request's timeout doesn't work as expected

Related topics