How to purge old documents or any better options to use rollover for dynamic index

Hi guys,
we have one request to purge the old data old than 7 days, when we saw this requirement the first choice what we think about is to use the index rollover function with time based index, for example create a new index to apply to alias and then delete the old index, but the pain point is for our case all the business functionality are async, means when we create a new index base on the time some data in the old index may still to be loaded by some business and do some update action, that means all data in the previous index we can't guarantee it's the static now during we create a new time based index, does anyone has same problem? And is there anyway we can handle this case?

Thanks in advance!

for example we have index workflow-20190701, and now in the eighth day we may create a new time based index workflow-20190708, but some business request still need to read and update data in workflow-20190701, that means during creating the new index workflow-20190708 some traffic we can't control it now, anyone has any good idea?

Hi all, any good suggestion?

Welcome.

Please don't ping people who are not yet involved in your question.

@dadoonet sorry about that, remove that msg just now.

I believe that your application need to be aware of this purge policy.
I mean that your application should not probably send data in that case as it's out of the retention policy of 7 days.

My 2 cents.

Hi David,
Thanks for your quick apply. But for our business case the index is old as 7 days not means all docs are old than 7 days in the index, we also need update the doc which indexed to that index recently(maybe latest docs created on 7th day). for this case how to avoid it? or is there any other way to only create a new index which include the docs which not old than 7 days?

for example workflow-20190701 include doc created from 2019/07/01 to 2019/07/07, how to create a new index will include the docs created in 2019/07/07 because the latest docs may has some async call to update it?
Thanks in advance!

I don't think I understand.

Could you illustrate that with some sample documents, requests?

we can't do time base index because our business are async, once we create a new index with retention policy but some logic still need load and update data from old index. that's the problem.
For example we config the rollover index as daily, at midnight the new index create as a new index-1 and the old index-0 which has all the old data, but some docs in the old index-0 still need to be update, but with one alias we can't write/update data in two indexes, that's the major problem what we have.

what I'm research now is to reindex the active document to a new index, then switch to use the new index. For example every mid-night to reindex the previous 5 hours active docs to a new index index-new then apply this new index as an active one to our business side. any suggestion?

Why do you want to remove indices and data if the data could be updated in the future?

the data is huge and each shard size over 300 GB, that's why we want try with some retention policy or reindex a new index to give up the data old than 5 or 7 days because we can garuatee the data old than 5 or 7 days could not be loaded and updated, but seems the time bases or retention policy will lose all the data after create new index.

So if you delete data older than 7 days, would that work? I don't see the problem.

Yes, you're right delete data older than 7 days works, and the current solution is we have a batch job or delete by query task to purge data old than 7 days, but which is not efficient because huge documents. so we wonder is there any other good option we can choose like create a new index and just delete old index with efficient way.

So create one index per day and after 7 days, drop the oldest index.

@dadoonet Hi David, as I said in the beginning, if we create one index per day, for example we created one index yesterday and today create a new one, but for some async cases the data may loaded and updated from yesterday's index, after we created a new one some traffic still could go to new index, with the index alias I think one alias only can support one write index.

The document should go to a given index based on its data. Whenever the data is sent (it could be tomorrow), it should always go to the same index.

thank you so much. I still not very clear how the data go to the same index if we create the new index per day? I will give a case, one data stored in RDB and which also indexed to ES index-001, but today we create a new index-002 then some user action updated the data in RDB then a event will trigger to update the index document, but as we already created a new index today index-002 so during the updating the index name should the index-002 but the data is only in index-001 which the event want to update. or does you mean everything even we create a new index just ignore the old one and create the new doc into new index with each event?

Let's say that the current date is 18/07/2019.

I'm indexing this document:

{
  "date": "18/07/2019",
  "foo": "bar"
}

For this I'm doing:

POST test-2019-07-18/_doc
{
  "date": "18/07/2019",
  "foo": "bar"
}

Let say we are now on 19/07/2019. I have 2 documents to index:

{
  "date": "19/07/2019",
  "foo": "bar"
}
{
  "date": "18/07/2019",
  "foo": "bar"
}

Here is what I will run:

POST test-2019-07-19/_doc
{
  "date": "19/07/2019",
  "foo": "bar"
}
POST test-2019-07-18/_doc
{
  "date": "18/07/2019",
  "foo": "bar"
}
2 Likes

Hmm, got your point David, thank you for the patient:+1:.
That means for business data we have one field or something else can indicated which index we should index into. I think it's also a good option for us, thank you again.
so now we may have two options, one is reindex active docs to a new time-based index everyday and use the new index, the other one is what you mentioned above. I will take some time to investigate and test it.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.