Would the search performance be very slow if a data stream consists of too many or too large cold backing indices?

Hi, I'm moving from es 6.x to 7.x, find the new feature data stream and would like to have a try.

But I'm a bit worried of the search performace for the docs explains When you submit a read request to a data stream, the stream routes the request to all its backing indices. doc source

Say we have a data stream alias as audit_log, consisting of hundreds of backing indices:

  • .ds.audit_log_2022.01_000001 HOT PHASE
  • .ds.audit_log_2021.12_000002 COLD PHASE
  • .ds.audit_log_2021.12_000001 COLD PHASE
    .....
  • .ds.audit_log_2021.01_000100 COLD PHASE

If we do search requests:

GET audit_log/_search
{
  "query": {
    "bool": {
      "filter": {
        "range": {
          "@timestamp": {
            "gte": "now-1M/d"
          }
        }
      },
      "must": [
        {
          "match": {
            "auditor": "tom"
          }
        }
      ]
    }
  }
}
  1. Can this search request only route to the several backing indices cause of the @timestamp range clause?
  2. If the above answer is NO. Is there any other alternative features could help us to route to indices according to date fileds by default.

Found a question alike: Elasticsearch search query on hot and warm nodes - Stack Overflow

Regarding to data streams, the problems turn to be:

  1. set aliases ds_search_recent and ds_search_all to the data stream
  2. remove alias ds_search_recent when entering into warm(cold) phases

But currently seems there is no automatic way to do these

A related github issue Add ILM action to add/remove aliases · Issue #47881 · elastic/elasticsearch · GitHub

Bad news it is still open since 2019

Another related PR Use @timestamp field to route documents to a backing index of a data stream by martijnvg · Pull Request #82079 · elastic/elasticsearch · GitHub

It seems to be a new feature of es 8.1, which is not released yet.

  1. Can this search request only route to the several backing indices cause of the @timestamp range clause ?

Technically no, but in practice ES achieves this goal anyway: it routes the search to every backing index but the ones that don't match the timestamp range get optimised into a MatchNoDocsQuery which obviously hits no data and takes no time to execute.

Thank you David. I am still curious that is this a feature of backing indices or of filter cache?

I mean would it be slow at the first search query and slow when not hit time span filter cache? For that common time series search cases always changed time span frequently.

This happens when rewriting the query into its optimised form, long before the filter cache gets involved. It involves comparing two long timestamp values, which is almost instantaneous and doesn't involve any caching.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.