Split index into subindexes by dates like 'my-index-yyyy.MM.dd'

tyro_plotter · July 26, 2023, 4:12pm

I am considering two options:

Time Series
Ingest pipeline (Date index name processor)

I am aware that they differ in their intended use, but I am trying to understand what risks there are.

As a newbie to the time series, I have a question about restrictions on using them. What problems can there be using TS if the documents are not typical metrics but hierarchical structures (not flat)? They mainly contain text and geolocations. Intensive searches and calculations between points will be done. Also, the number will be quite large.

The alternative is to use the date_index_name pipeline processor, but it seems to run a bit slower when indexing new documents.

PUT _ingest/pipeline/mypipeline
{
  "description": "daily date-time index naming",
  "processors" : [
    {
      "date_index_name" : {
        "field" : "date1",
        "index_name_prefix" : "my-index-",
        "date_rounding" : "d",
        "index_name_format": "yyyy.MM.dd",
        "date_formats": ["yyyy-MM-dd"]
      }
    }
  ]
}

When testing with volumes over 1,000 or 10,000 documents, the time can increase by over 25%. I don't know if this is due to the cluster architecture or the pipeline itself.

Could you please share your opinions?

leandrojmp · July 26, 2023, 5:18pm

If the Time Series you mean the Time Series Data Stream, then you need to pay attention to two things:

It is a data stream and data streams uses backing indices that per default will have the date the backing indice was created/rolled over and not the daily date in the name.
It is recommended if you have only metrics, if you have logs as well you should use normal data streams or normal indices.

Any specific reason to use daily indices?

I've been using Elastic Stack for 7+ years and have used daily indices all this time, but I'm stopping using daily indices for the majority of my use cases and moving my indices to use data streams with rollovers.

I would not recommend anyone starting with Elastic stack to use daily indices anymore, just use a data stream with rollover and let Elasticsearch manage the backing indices, this makes the management of the cluster and indices much easier.

The main reason is that it is recommended to aim for shard sizes around 50 GB and to have this with daily indices and multiple different indices can be pretty hard.

tyro_plotter · July 27, 2023, 5:00am

Thanks for the quick answer.

One of the requirements is to be included the index (as a time period) in the filtering.

GET /index-2023.07.21,index-2023.07.22,index-2023.07.23/_search?ignore_unavailable=true
{
    "query": {
        yourquery
    }
}

The concern is that with hundreds of millions of documents, the search will be very slow.
Analyzes (including geo queries) will be performed for random periods in real-time. (It's hard to get out of the sql partitioning mindset ;))

leandrojmp · July 28, 2023, 1:01pm

Search in elasticsearch was greatly optimized in recent versions, check this post, the part about Reducing shard requests in the pre-filter phase, it gives more details on how the search process work.

Basically Elasticsearch knows where your data is.

Every shard has an overhead and using daily indices can greatly increase the number of the shards in the cluster, you should try to follow the recommendations on how to size your shards, these recommendations can be easily done using data streams and rollovers, but it will need a lot of work if you choose to use daily indices.

tyro_plotter · July 31, 2023, 11:49am

Thank you very much for the competent answer,

This will make me rethink some of the requirements. I have one more question for you that still bothers me.

Because the ELK version is +8.* and we use (currently) NEST 7.17 for access (we haven't migrated to Elastic.Clients.Elasticsearch yet) what impact it can have on productivity.

system · August 28, 2023, 11:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bulk API & date index name ingest processor Elasticsearch	4	1810	May 8, 2017
Hourly indices with ingest pipeline - getting 12 hours always Elasticsearch	1	564	December 14, 2019
Elasticsearch date-index-name-processor when set to week, always create w52 Elasticsearch ingest-pipeline	1	450	September 14, 2022
Ingest Node Index According to Timestamp Elasticsearch	5	1004	December 27, 2016
Elastic sink data for new index with ingest pipeline Elasticsearch	0	76	May 26, 2024

Split index into subindexes by dates like 'my-index-yyyy.MM.dd'

Related topics