Search request's timeout doesn't work as expected

ES Verson: 7.15.0

We were exploring the significant_text plugin of ES (via REST API) for generating word cloud from our data. As the query took more time to execute, we decided to use a timeout. Here is what the official documentation of timeout states.

Each shard collects hits within the specified time period. If collection isn’t finished when the period ends, Elasticsearch uses only the hits accumulated up to that point. The overall latency of a search request depends on the number of shards needed for the search and the number of concurrent shard requests.

And here is my REST API.

GET /<24-indices>/_search
{
  "timeout": "500ms",
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "channelId": [
              // 50+ channel ids
              ]
          }
        },
        {
          "range": {
            "postPublishedOn": {
              "gte": "2021-01-01",
              "lte": "2022-12-31"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "sample": {
      "sampler": {
        "shard_size": 10000
      },
      "aggs": {
        "keywords": {
          "significant_text": {
            "field": "caption",
            "filter_duplicate_text": true,
            "size": 25
          }
        }
      }
    }
  }
}

Here is the result

{
  "took": 29652,
  "timed_out": true,
  "_shards": {
    "total": 24,
    "successful": 24,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": 0,
    "hits": [
      ...
      ]
  },
  "aggregations": {
    ...
  }
}

As we can see, the request is timed out but still took around 30s despite setting the timeout to 500ms. What am I missing here?
The search went through 24 shards. As per the doc, some of these 24 requests would have executed parallelly with each of them having a timeout of 500ms. Even if it is executed sequentially, 24 x 500 ms = 12s. Which is way below 30s.
So here are my questions.

  1. Will the requests to different shards execute parallel in all circumstances? If not, then in what scenario would they execute sequentially?
  2. In either case, why is the query taking more time despite getting timed out?
  3. This one is about significant_text plugin. Does this plugin really take that long to complete execution? Below are more details about my dataset.

Shard details:-

{
  "docs": {
    "count": 106287465,
    "deleted": 30635607
  },
  "shards": {
    "total_count": 24
  },
  "store": {
    "size_in_bytes": 106441596761,
    "total_data_set_size_in_bytes": 106441596761,
    "reserved_in_bytes": 0
  }
}

Approx matched doc count: 35k

Hi,
The timeout setting is best-efforts and relies on timer checks woven into the various “tight loops” in the search code.
One of the heavy loops that is missing timer checks is the one in the significant_text aggregation where it checks the background frequency of all the terms found in matches. For this reason (and to limit memory usage) it is recommended that the “sampler” aggregation is used in conjunction with the significant_text aggregation to limit the number of matching docs analysed and therefore the number of matching terms. Another way to limit the number of background frequency checks is to increase the value of “shard_min_doc_count” eg to 3 or 4 to only consider terms seen in this number of docs.

2 Likes

You were right about the background frequency. But setting 'shard_min_doc_count' to 3 or 4 has no effect in the performance for my data. Setting it to a higher value (~100) has some positive impact on the performance, but that won't be feasible as the result might filter out a lot of terms.

Like the docs for significant_text say - use the sampler aggregation:

Re-analyzing large result sets will require a lot of time and memory. It is recommended that the significant_text aggregation is used as a child of either the sampler or diversified sampler aggregation to limit the analysis to a small selection of top-matching documents e.g. 200. This will typically improve speed, memory use and quality of results.

Yes we are using sampler. You can check that in my query. Based on our analysis, it takes approximately 1 second to aggregate 1k records, which looks much slower for me. In the example that I have provided, 35k records are aggregated which took 30s.

Is it normal to take 1s to aggregate 1k samples in significant_text?

Ah sorry. Missed that.
The significant_text aggregation reanalyses the text of documents to get at the words used. That is the process that happens at index time so will be roughly similar to indexing speeds. The filter_duplicate_text setting will also add some overhead.

Note also that time-based significance doesn’t work too well on time-based indices. It’s hard/impossible for a shard to determine what is the difference between this week’s content and previous weeks if it doesn’t hold any of the previous weeks’ documents.

We haven't configured any analysers. So I don't think there could be any indexing overhead. The 'caption' field that we use in the significant_text is a text field.

And I tried removing filter_duplicate_text param. No performance impact. I also tried to add the 'caption' field to the _source in query. No change in the performance.

Should I file a github issue for this as we are moving away from the original topic?

You will be using the default analyzer which, like all analyzers, has overhead.
There is no bug here. You could try smaller sample sizes but summarising a year’s worth of content with a small number of docs is possibly a big ask.

The sample that we use itself isn't really huge. Even for some 10k docs, it would take around 10s to aggregate. Is this usual in singinficant_text plugin?

The cost is linear with:

  1. the number of docs sampled
  2. the length of the text in each of the docs
  3. the number of unique terms found in the sample that need their background frequency looking up.

The “usual” cost depends on all these factors.
I haven’t seen anything in your trials that suggests the observed costs aren’t following this profile.

I found that our data index at a speed of 1 ms/ doc. Which explains the latency.

Is there any other plugin that directly uses the analysed token rather than re-analysing? I saw significant_term and terms aggregation. But not sure about their internal implementation.

Posting this in github - significant_text performance issue · Issue #94712 · elastic/elasticsearch · GitHub

Lucene has an option to store term vectors (the raw output of analysis) but that costs extra in disk storage and reading this expanded data from disk is not substantially faster than re-analysis of the original text in many cases

I’d recommend closing this GitHub issue as it is more of a question than a bug or proposed fix.

Hey @Mark_Harwood1, as stated in the GitHub issue, why isn't there a parallel execution? The data split in 24 shards, which should complete the query 24 times faster than the indexing time. Why is that not the case here?

I think you are assuming that searches happen in parallel across the shards whereas indexing does not. This isn't the case: indexing is also parallelised across shards.

There is.
Parallelisation doesn't help greatly with this problem anyway - answering the question with decent quality requires a global view of term stats and using a distributed data model for parallelisation doesn't lend itself to that. Large, partial sets of term stats need to be streamed from data nodes to the coordinating node for consideration.

It's probably best for you to assume we have spent a good deal of time optimising the performance of this feature and that it is one of those tasks that will always remain expensive. Better perhaps to invest your effort in considering how your application may pre-compute and cache answers to your expensive questions - e.g. I can't imagine summarising a year's worth of content is something every user needs to do on-demand. For example, on my news-analysis website I summarise a year's worth of news headlines in a background batch process rather than computing it over and over for each unique site visitor.

1 Like

Hey David. I found the indexing speed via the _stats api and I assume the time specified there is the summation of the indexing time regardless of async indexing.

"indexing": {
  "index_total": 376426353,
  "index_time_in_millis": 414117588,
  "index_current": 0,
  "index_failed": 0,
  "delete_total": 0,
  "delete_time_in_millis": 0,
  "delete_current": 0,
  "noop_update_total": 0,
  "is_throttled": false,
  "throttle_time_in_millis": 0
}

The above stat tells us that the average time taken for indexing is 1 ms per doc

That's our use case. Perhaps we could try some other ways for achieving it.

Anyways, thanks for putting in the effort!

1 Like

That's true, this is the total time spent indexing across all indexing threads.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.