Duplicates between rolled over indices

Hi everyone,

I realise this is a fairly common part of the learning curve with implementing ES, I just cant find quite the right info to ease my mind about our indexing strategy. Hoping to get some guidance here.

We have fairly standard indexing strategy like so:

ticket alias;
  ticket-000001
  ticket-000002
  ticket-000003

We have been syncing these 'tickets' with a pub sub pattern and has been working pretty well.
However we ran a sync script to backfill data from 2015 onwards, and we've ended up with 2022 data in ticket-000001, 2015,16,17 data mixed up with 2022 data in subsequent rolled over indices.
Its a pretty small data set - approx 60 million. They are created and then updated when they expire 8-12 hours later.

To fix these duplicate issues I thought it best to revise our indexing strategy and I'm thinking of going with a yearly index for 2015-2021 and monthly thereafter, as they are more commonly searched. The application that indexes the data to ES will then have a monthly/yearly interval strategy to decide which index to put the data into.

I'm utilising a pipeline like so to index to monthly or yearly indices, but I am concerned about how scalable that is so would like to use rollovers within these months, but I believe this will introduce issues with updating data between rolled over indices ( ie created in ticket-2022-08-01-1 > index rolls over, update comes through and goes into ticket-2022-08-01-2.
To remedy this I can see I'd have to be making a query for each document that is updated to figure out which index it resides in and I'm not too keen on adding extra network calls like this.

{
  "description": "yearly date-time index naming",
  "processors" : [
    {
      "date_index_name" : {
        "field" : "created",
        "index_name_prefix" : "ticket-",
        "date_rounding" : "Y"
      }
    }
  ]
}


PUT _ingest/pipeline/monthlyindex
{
  "description": "monthly date-time index naming",
  "processors" : [
    {
      "date_index_name" : {
        "field" : "created",
        "index_name_prefix" : "ticket-",
        "date_rounding" : "M"
      }
    }
  ]
}

FYI we are using the bulk update/doc_as_upsert API.

    const body = batch.flatMap((event) => {
      const parsedDoc = transformHandler(event.data);
      const result = [
        { update: { _index: CoreAliasWithPrefix[alias], _id: parsedDoc[key] } },
        { doc: parsedDoc, doc_as_upsert: true },
      ];
      return result;
    });

    const { body: bulkResponse } = await client.bulk({ refresh: false, body });

Questions:

  • Does it matter if documents from vastly different time frames are mixed up in same index?
  • Are time based indexes the best option here?
  • Could I utilise monthly aliases rather than one over-arching alias for all docs?
  • Is there a nice way to utilise lifecycle policy with rollover for monthly indices, or should I abandon use of the alias API, and just keep an eye on index size and change the date math index rounding when the monthly index start getting too big? ( A long time away from that happening at this stage ).

Any guidance on this would be great.

Time-based indices, whether through index naming or rollover, is generally used for timeseries data in order to make retention management efficient and to spread the data across multiple indices over time and thereby limit the index size. When you are working with very large volumes of immutable time series data where volume can vary over time it can be very hard to control the shard size, and this is the problem rollover solves. It allows you to cut new indices based on a size and/or time and can be used to get a much more even shard size if you are willing to be flexible around the time period each underlying index covers and not perform updates. Refills also pose a potential problems as all new data goes into the latest index, which affects retention of data.

If your data does not fit the described criteria it is quite possible that rollover and/or data streams are the right solution for your use case.

Elasticsearch can hold a lot of data in a single index and shard sizes up to tens of GB does generally not result in performance problems when querying. You nowadays also have additional tools that can be used to fix issues related to shard size, e.g. split index API, which allows other indexing startegies that can adapt without having to reindex your data.

Bsed on this, let me answer your questions.

Although it can have some impact on querying the main issue is thatit affects retention management. As the oldest indices are deleted you will still have older data in the cluster for quite some time.

You mention that you update data at some point after ingestion. This in my mind make rollover a poor fit as an you may end up inserting some data indo an index and then try to update it in another as rollover has occurred in between.

If you know the timestamp of the transaction you are updating you can use time-based indices where each index covers a specific time period based on the name. Whether this is necessary or not depends on your current and projected data volumes. A single shard can hold up to 2 billion documents and it is recommended to aim for a size of at least a few GB. As an index can have multiple primary shards a single index can hold a lot of data. You mention currently only having around 60 million documents. If these are not huge I would suspect this easily fits into a single shard. I suspect time-based indices may not at all required in your use case, which would make updating easier but make deletes more expensive.

Unless the yearly shard size exceeds tens of GB I do not see the. point.

It does not sound like rollover is a good fit for ýour use case.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.