Selective retnetion for an index

We have a requirement in Elasticsearch where, for a single index, we need to store certain fields for one month and others for three months.

Since there is no direct selective retention with ILM , we aim to achieve this by using the clone option.

The plan is to send specific fields to index_one, which has a longer ILM policy, and the remaining fields to index_two, which has a shorter ILM policy. Both indices can then be viewed collectively under index* in Kibana.

Is this a good approach? Additionally, how resource-intensive is the cloning process when dealing with large datasets?

Parllely we also tried with custom scripts that run on the index and delete the field but this will have additional load on our elastic cluster as it parses each record for field deletion so had to rule this out.

We are also exploring Tranfsorm Indices.

Is there any other way to have selective retention for fields in a single index to save storage.

I'm wondering if you can use field level security to change the visibility of a field for some users after a month.

Not sure but may be a trick that would avoid expensive reindex operations.

Although you can send specific fields to different indices, be aware that Elasticsearch does not support joins so you will not be able to run queries across all fields.

Deleting fields will require reindexing, which can be expensive. An option to save some space and avoid expensive reindexing might be to simply have two series of time-based indices. Into the first one you send the documents with all the fields and you retain this for just 1 month. Into a separate index you send the stripped down documents and keep these around for 3 months. This means that you duplicate storage for the reduced set of fields for one month out of three, but it saves a lot of potentially expensive processing and is simple and low risk.

2 Likes

Very good idea @Christian_Dahlqvist! :wink:
BTW joins are coming in 8.18 with ES|QL.

1 Like

How you replicate same data to indieces ?

As I discribed above my ideas is to use clone and duplicate data

You index the data twice, e.g. using Logstash with a clone processor and dual outputs.

Is that general joins where you can join multiple large indices or more like a lookup where you can join a large index towards a limited size data set?

My data is not always an array to apply split filter and index into 2 different outputs

I meant to write clone processor but somehow got it wrong. I have fixed my previous post and provided a link to the appropriate section in the docs.

1 Like