How to find out if document was deleted



I wonder how to find out if the the document was already deleted from the ES, not just marked as deleted.

I have a pack of data that consist of multiple documents that I want to be indexed by the fulltext. I use only index and delete operations since I am using the external version. I am using the bulk api to lower the number of calls.

There might be a multiple packs of data each with its separated routing.

It can happen that I would need to synchronize some pack of data in the ES. My data is only keeping information about deleted documents for some time so in the event of loosing connection with ES for longer I would not know that those documents should be deleted.

Therefore I did expect that I will use delete_by_query call to delete all documents with specified routing and then index them again. However in order to do that I would need to know when it is safe to index data again without version conflicts.

I keep separated type which holds my latest synchronized version in the data so I thought that if I will delete it as last one then if the ES is purging the deleted files sequentially that it would be safe to start index again, however I am not sure if that is such case.

Is there any better way how to find out that there is such a file on the ES that is just marked delete?

Thank you for your answer.

(Mark Walkom) #2

When you get a response from the DBQ it's safe.


No its not. DBQ returns immediately and the document is just marked as deleted but still present in the system.

My application allows multiple nodes to send data into the ES. It can happen, that a node is trying to send the data to the ES which are little older. That is where the versioning take place. When we detect version conflict from the elastic the app just assume that a newer version is already present on the ES and skips sending it to the ES.

Lets say I have indexed documents with this HTTP request:

POST /_bulk?refresh=wait_for

Now I loose connection with the elastic for some time. Meantime the document with _id 2 has been modified to version 6 and the document with id 3 has been deleted and the time is so long that the record of its deletion is no longer kept. Therefore I need to synchronize a data by first removing them to ensure no file that was already deleted is kept in the ES and then index them again.

Therefre I call:

POST /index/test/_delete_by_query?routing=routing1
  "query": {
    "match_all": {}

This deletes all documents in given routing. Now I need to index data again. Document with id 1 is not changed and document with id 2 was changed to version 6 and the document 3 was deleted and is no longer present so my query is like this:

POST /_bulk?refresh=wait_for
{"field":"other data"}

What I gen now is a version conflict for document 1, document 2 went ok because delete_by_query just increased version by 1 and marked as delete.

Now half of the indexing perform half of them did not with version conflict message where I have no clue if this is because the document was already on higher version by me or just DBQ has not just deleted the document from the ES yet.

Therefore I would need some information how to find out if the document has been really deleted or is still there.

(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.