Deletion of existing index data when threshold reached

Hi Expert,

We are looking to support the customer use case where they are buying the license as per the GB of data they use or no of records they store or no of days data stored.

As per license they can either buy more storage or in existing storage they can delete the oldest data.

Now we have 3 strategies to implement it.

Strategy 1:
Customer has their own dedicated index. do the bulk delete of their oldest records to reclaim the space for them when threshold reached.

But in this case as per Elasticsearch space cannot be immediately reclaimed as in Elasticsearch it is soft delete.
So we have option to run the following API
POST //_forcemerge?only_expunge_deletes=true

But as per es document this is not good practice to frequently call force merge API.

Strategy 2:

Let's take example customer bought 40GB space.

Rollover your index after each 10 GB of index data. Once total index data reaches to 40 GB delete the oldest index.
Downside of this approach is rolled over indexes will be more. Our cluster can suffer with too many shards issue.

Strategy 3:

Let's take example customer bought 40GB space.

Here once space is reached to 40 GB. Delete the oldest data as per conditions and reindex existing index to new index. Drop the old index. In this case immediately space will be reclaimed.
Downside of this approach is unnecessary we are adding more processing to our cluster by reindexing of data.

Please suggest which one is better way to go forward. This is very common use case people may have seen. Please share your experience.

What license are you talking about?

If you have a product that uses Elasticsearch and you provide access to Elasticsearch and Kibana to your users, you may need to check if you are allowed to do that according to the Elastic License.

Strategy 2:

Let's take example customer bought 40GB space.

Rollover your index after each 10 GB of index data. Once total index data reaches to 40 GB delete the oldest index.
Downside of this approach is rolled over indexes will be more. Our cluster can suffer with too many shards issue.

Use rolling index for this will be the perfect solution.
You roll over to a new hard index every 10GB. You can run the rollover API say every hour.
And have a task to only keep the most 4/5 recent hard indices and delete the older ones.

I am talking here the the produt we have devloped which is underlying using the Elasticsearch as data store. So here we are talking about our license where we will be providing them the option to delete the data.

Oh, I see. Do your clients have access to Elasticsearch or Kibana? From your question it seems that they have direct access. If I'm not wrong this is not allowed by the license, it is worth a check.

Anyway, there is no better strategy, you already listed some of the pros and cons of each one, all of them can have some impact performance in your cluster.

The main issue is that rollover was not built to work as a limitation as your use cases.

For example, if your client buy a licence to have 1 million documents, you would use a rollover by number of documents, but in which number will you rollover the index? If you rollover até 1 million, when it rollover your client will lose access to all the previous document, I'm no sure this is what it will expect, so you would need to rollover in small increments, maybe 10000 documents, and when it reaches the threshold your tool will need to delete the old index as this is not done by elastic.

The same thing applies to rollover per GB, you would need to rollover in small increments and delete the oldest index, for example, rollover each 5 GB and when the threshold is hit, the oldest 5 GB will be deleted, this is also not done automatically, you will need to build your self in your tool.

The easiest way is the rollover by day as the delete phase can automatically delete based on date.

These use cases will not be easy to manage, may impact the performance and you will need to build a couple of things to deal with them.

We use rollover for similar usecase and it works well.
The only down side is you'll need to store more than the requirement.
If you rollover per 1M documents, just keep 2 indices (2M documents) and delete the rest.
Or waste less with 100K rollover and keep 11 indices, etc.
Keeping more than what you promised the customer is often not an issue.
Trading storage for complicated (super accurate) algorithm is overkill IMO.

KISS.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.