Hi everyone,
I realise this is a fairly common part of the learning curve with implementing ES, I just cant find quite the right info to ease my mind about our indexing strategy. Hoping to get some guidance here.
We have fairly standard indexing strategy like so:
ticket alias;
ticket-000001
ticket-000002
ticket-000003
We have been syncing these 'tickets' with a pub sub pattern and has been working pretty well.
However we ran a sync script to backfill data from 2015 onwards, and we've ended up with 2022 data in ticket-000001
, 2015,16,17 data mixed up with 2022 data in subsequent rolled over indices.
Its a pretty small data set - approx 60 million. They are created and then updated when they expire 8-12 hours later.
To fix these duplicate issues I thought it best to revise our indexing strategy and I'm thinking of going with a yearly index for 2015-2021 and monthly thereafter, as they are more commonly searched. The application that indexes the data to ES will then have a monthly/yearly interval strategy to decide which index to put the data into.
I'm utilising a pipeline like so to index to monthly or yearly indices, but I am concerned about how scalable that is so would like to use rollovers within these months, but I believe this will introduce issues with updating data between rolled over indices ( ie created in ticket-2022-08-01-1
> index rolls over, update comes through and goes into ticket-2022-08-01-2
.
To remedy this I can see I'd have to be making a query for each document that is updated to figure out which index it resides in and I'm not too keen on adding extra network calls like this.
{
"description": "yearly date-time index naming",
"processors" : [
{
"date_index_name" : {
"field" : "created",
"index_name_prefix" : "ticket-",
"date_rounding" : "Y"
}
}
]
}
PUT _ingest/pipeline/monthlyindex
{
"description": "monthly date-time index naming",
"processors" : [
{
"date_index_name" : {
"field" : "created",
"index_name_prefix" : "ticket-",
"date_rounding" : "M"
}
}
]
}
FYI we are using the bulk update/doc_as_upsert
API.
const body = batch.flatMap((event) => {
const parsedDoc = transformHandler(event.data);
const result = [
{ update: { _index: CoreAliasWithPrefix[alias], _id: parsedDoc[key] } },
{ doc: parsedDoc, doc_as_upsert: true },
];
return result;
});
const { body: bulkResponse } = await client.bulk({ refresh: false, body });
Questions:
- Does it matter if documents from vastly different time frames are mixed up in same index?
- Are time based indexes the best option here?
- Could I utilise monthly aliases rather than one over-arching alias for all docs?
- Is there a nice way to utilise lifecycle policy with rollover for monthly indices, or should I abandon use of the alias API, and just keep an eye on index size and change the date math index rounding when the monthly index start getting too big? ( A long time away from that happening at this stage ).
Any guidance on this would be great.