Best practices to delete / mark for delete specific documents in an index?

Hello, In my use cases i use a monthly indexes, which store large amount of documents.
When user adds some kind of configuration, some of those documents need to be deleted.

I understand that it's not efficient to keep deleting by query, and i wanted to know if i have any other option to do so?

  1. Mark for delete? meaning, run a query and somehow mark the documents to be "non-searchable/non relevant" (then they will be deleted after a month)
  2. Move them to another index? (which is basically same as delete no?)

i cannot arrange the indexes in a way that we i can drop an entire index, as i cannot predict which documents need to be filtered out.

Any ideas?

Hi @Ariel_B,

This is effectively what delete (including delete-by-query) does. Deleted documents aren't really deleted until merging. However, updating a document (e.g. to mark it as unsearchable via some other mechanism) involves deleting the old copy of the document and then indexing the new copy of the document, which is more work than just deleting the document.

Yes, this would also involve deleting the documents and then doing some extra work.

I will need to "bring them back to life" if a user decided to cancel his action.
I thought about even using document level security to change specific documents' security (to hide them) - but it's a payed feature so i cannot do it.

is there any other way to achieve it that performance wise would be better?

Why don't you add a date field, for instance deleted, to your index mapping which you update with the time the user decided to delete that document? Then you just have to filter away all documents with a non-zero "deleted" value so that they don't show up in the regular search results. If the user regrets the deletion you can just update the document with a zero value in the deleted field.

To actually delete the documents you can run a nightly delete-by-query job matching all the user's documents that has a deleted time older than 30 days, using a range query like so:

  "query": {
      "range": {
          "deleted": {
              "lt": "now-30d"
          }
      }
  }

Hope this gave you some ideas :slight_smile:

1 Like

Hmm, doesn't this mean that i would need to run an update for all documents that match the user's action? to update that "deleted" field (if i do that, i'd rather just use a flag "deleted=true/false".

Isn't this highly inefficient? as elastic performs delete+insert for each of the document

Whether you use a flag or a timestamp you still have to update one field in all the documents that the user marked for delete.

Since you have to update all the documents even with the Boolean flag solution I don't think the timestamp solution will be more inefficient. And the reason I proposed using a timestamp was that it solves your second requirement, of deleting the "deleted" documents after 30 days. I don't see how you can solve this with just a Boolean flag.

Now, a different and certainly more efficient solution would be to index & delete rather than update. Let's say you have a recycle_bin index, then you could do the following for every document the user marks for delete:

  1. Read the document from the regular index before deleting it
  2. Index the document as a new document in recycle_bin
  3. Delete the document from the regular index
  4. After 30 days delete the document from recycle_bin

This way the user will have 30 days to regret the deletion, if so the document can be read from recycle_bin and indexed as a new document into the regular index. But after 30 days the deleted document is gone forever.

The 30 days is applied for all documents in the index, the index is actually dropped after 30 days.
the user can "regret" the action from when he did the action up to 1 hour.

What's the difference in performance between read from index 1, then insert to a new index (recycle bin) and delete the original docs VS just updating the docs? (which performs the same no? (delete + insert)

Oh, I see.

But how do you measure one hour if you only have a Boolean flag? Do you store the time of the user's delete action outside Elasticsearch?

I'm sorry, I was a bit quick there to conclude it was more efficient to save the "deleted" document to a recycle_bin.

However, I still think this is a fairly clean solution as it will create fewer copies of the document in the regular index; because the underlying Lucene segments, where the document is stored, are immutable an update means that Elasticsearch first have to fetch the document, make the changes (the update) and store the new copy in a new segment (akin to an index operation) before marking the old version as deleted. So after the update you will have an invisible deleted copy plus the new copy of the document in the same index which means if you update many documents you may get many deleted docs in your index (before they get removed in segment merges).

If you use a recycle_bin there will remain just one copy in the regular index, the one marked as deleted, while a fresh copy resides in the recycle_bin. You can make this fairly clean and efficient by creating daily recycle_bin indices, say recycle_bin.2019-08-05, recycle_bin.2019-08-06 etc, that you just DELETE after 24 hours instead of deleting single documents. Deleting an entire index is much faster and more efficient than deleting single documents. Plus it frees the disk space from the deleted documents immediately.

I actually wanted to do a hourly index - per user action (unless it's an overkill) - i'd just delete the hourly index after an hour, so my flow will be:

  1. User does an action
  2. Query the documents in the main index that relate to this action
  3. Move them to a hourly per action index, and delete from main index
  4. if user regrets, move them back and delete the action index
  5. after 1H, the entire per-action index should be deleted.

Besides that i have another complexity, is that 2 seperate actions can affect the same document, so if one user regrets, the document should not be returned, as the 2nd action deleted it.
i can manage it by retaining connections between those actions

It gets complex, but the other option is to retain 1 recylce_bin index, and save the connections on the actions on the documents themselves (meaning - i'll have to do uppdates on those document, and they will contain the "action" field, for example, the recycle bin can be something that:
{
"doc name" :"1"
"Actions": {
["Action#1", "Action#2"]
}
}

{
"doc name" :"2"
"Actions": {
["Action#1"]
}
}

which will cause every user UNDO to find the documents containing those undo actions, and perform the following:

  1. remove the action from the "Actions" array, if it's the only one - bring back the document to life
  2. If there are other actions there, keep them in the recycle bin

after 1H i'll also need to have a process that deletes those documents (and updates the ones that have more actions).

it gets complex, but maybe the efficiency there is not terrible as the undo doesn't happen alot

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.