We are using Elasticsearch 7.17.7 and indexing documents via BulkRequest in Java API Client, and we noticed that many documents are being deleted after indexing. We retrieve the records from a Postgresql database and use their primary key, which is a UUID, as the elastic document ID, so we expect the number of records in the database and the final number of indexed documents to be identical. However, after sending 112,938 documents to elastic, 42,662 were deleted and only 70,276 remain in the index (please see the stats file below). Duplicate IDs shouldn't be an issue, since when we query the Postgresql database we in fact get 112,938 records, hence there indeed are this many distinct UUIDs.
What other criteria would elastic be using to delete documents? We tried indexing different databases, but the same happens for all of them. Indexing the same database multiple times also results in different numbers of deleted documents. We are at a loss about why this is happening, so any insight is welcome.
The sum of count and deleted is exactly the number of rows sent in. The deleted number indicates an update. I would recommend looking at a few documents in Elasticsearch and verify the ID is what you expect. Is the UUID field the only field that is part of the primary key? Does the query you run include any join that could cause a UUID to appear more than once? What do you get if you do a SELECT COUNT DISTINCT on the UUID column?
Yes, the IDs of the documents in Elastic are exactly what I expect, that is, the same UUIDs coming from the database. Please see the picture below, with a few lines from the _id column in Kibana, and from the UUID column in the database, side by side
Elasticsearch does not automatically delete anything, which is why I suspected duplicates.
The deleted document count can change when merging occurs in the background, so may not always show the full number of deletes.
To troubleshoot this further I would do two things:
Ensure you use a create action rather than an index action in your bulk requests. If there for some reason are something that is considered a duplicate by Elasticsearch this will be indicated in the respons, so make sure to check this for errors and log whatever you find.
Try to identify some documents that are missing from Elasticsearch and see if there is any pattern with respect to the records that in the index compared to the ones that are missing.
Took some time to analyze the code and the records, but reached no conclusion...
and the response says nothing but that the request was successful, no errors and no further information. Another evidence there are no errors is that all records are sent and received, as confirmed by the "index_total" in the stats file.
Also, what do you mean by "If there for some reason are something that is considered a duplicate", if, as I said, I can ensure there are no duplicate IDs? I go back to my first question: what other reasons, besides equal IDs, would result in duplicates?
I checked a number of records, both indexed and not indexed, and could identify no pattern at all.
I don't think there are other reasons, a duplicate in elasticsearch would happens when you try to index a document with the same _id.
As Christian said, Elasticsearch will not delete any document by itself, if you have deleted documents this means that your code tried to index documents with the same _id and this lead elasticsearch to updated the previous indexed document, which means delete the old document and index the new one.
But since you have 112938 unique UUIDs, you should also get 112938 unique documents in elasticsearch, the fact that you are getting some deleted documents and a lower number of documents in elasticsearch and indexing the same database multiple times results in different numbers of deleted documents this could mean that you have some issue in your code that may be trying to index different documents but using the same _id.
Can you share your code or at least the parts related to the indexing?
Have you tried to change your code to index your documents without using a custom _id, but let elasticsearch set the document id to see how many documents you will end up?
I mean their shape is as I expect them to be, that is, they are formatted as a UUID, precisely as I intended they to be. Yes, there are a lot of missing IDs (and hence missing documents), this is exactly the problem I'm reporting.
I see, ok, if you only ever added documents with ID 0002ead6-4606-4042-9ac6-0788ffc1864b9 (even with duplicates) and at least one of those indexing requests succeeded, then Elasticsearch would have a document with that ID. So either it's not being indexed successfully with that ID, or something else is deleting it.
Send bulk requests with the create action instead of index action. This will force an error instead of an update if there are duplicates and should help you troubleshoot as long as you analyse and log response errors.
OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance.
(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns )