Split index into subindexes by dates like 'my-index-yyyy.MM.dd'

I am considering two options:

  • Time Series
  • Ingest pipeline (Date index name processor)

I am aware that they differ in their intended use, but I am trying to understand what risks there are.

As a newbie to the time series, I have a question about restrictions on using them. What problems can there be using TS if the documents are not typical metrics but hierarchical structures (not flat)? They mainly contain text and geolocations. Intensive searches and calculations between points will be done. Also, the number will be quite large.

The alternative is to use the date_index_name pipeline processor, but it seems to run a bit slower when indexing new documents.

PUT _ingest/pipeline/mypipeline
{
  "description": "daily date-time index naming",
  "processors" : [
    {
      "date_index_name" : {
        "field" : "date1",
        "index_name_prefix" : "my-index-",
        "date_rounding" : "d",
        "index_name_format": "yyyy.MM.dd",
        "date_formats": ["yyyy-MM-dd"]
      }
    }
  ]
}

When testing with volumes over 1,000 or 10,000 documents, the time can increase by over 25%. I don't know if this is due to the cluster architecture or the pipeline itself.

Could you please share your opinions?

If the Time Series you mean the Time Series Data Stream, then you need to pay attention to two things:

  • It is a data stream and data streams uses backing indices that per default will have the date the backing indice was created/rolled over and not the daily date in the name.
  • It is recommended if you have only metrics, if you have logs as well you should use normal data streams or normal indices.

Any specific reason to use daily indices?

I've been using Elastic Stack for 7+ years and have used daily indices all this time, but I'm stopping using daily indices for the majority of my use cases and moving my indices to use data streams with rollovers.

I would not recommend anyone starting with Elastic stack to use daily indices anymore, just use a data stream with rollover and let Elasticsearch manage the backing indices, this makes the management of the cluster and indices much easier.

The main reason is that it is recommended to aim for shard sizes around 50 GB and to have this with daily indices and multiple different indices can be pretty hard.

1 Like

Thanks for the quick answer.

One of the requirements is to be included the index (as a time period) in the filtering.

GET /index-2023.07.21,index-2023.07.22,index-2023.07.23/_search?ignore_unavailable=true
{
    "query": {
        yourquery
    }
}

The concern is that with hundreds of millions of documents, the search will be very slow.
Analyzes (including geo queries) will be performed for random periods in real-time. (It's hard to get out of the sql partitioning mindset ;))

Search in elasticsearch was greatly optimized in recent versions, check this post, the part about Reducing shard requests in the pre-filter phase, it gives more details on how the search process work.

Basically Elasticsearch knows where your data is.

Every shard has an overhead and using daily indices can greatly increase the number of the shards in the cluster, you should try to follow the recommendations on how to size your shards, these recommendations can be easily done using data streams and rollovers, but it will need a lot of work if you choose to use daily indices.

1 Like

Thank you very much for the competent answer,

This will make me rethink some of the requirements. I have one more question for you that still bothers me.

Because the ELK version is +8.* and we use (currently) NEST 7.17 for access (we haven't migrated to Elastic.Clients.Elasticsearch yet) what impact it can have on productivity.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.