Elasticsearch not showing correct count of documents in index

Hi.

Our Java 8 based application is sending an input data of total 17061816 documents to elasticsearch 7.17.4 to index these documents. However, after all indexing is completed, the curl _count is showing a total of 16833817 indexed documents. There are no errors in elasticsearch log as well. I've enabled the DEBUG log at root logger but still no significant error is shown. I am not sure what could be the reason here? The input data is the count of distinct records as well.

Any help is highly appreciated.

Are you using custom document _id or letting elasticsearch choose the document _id?

Using custom document_id. Still we can't figure out what could be the cause as there are no errors in elastic logs as well as in application logs. The custom_id is a unique default id generated auto generated by Neo4j. We're sending 7500 in a batch for indexing to Bulk Request API. There are no errors or exceptions received from elastic. The application drops all indexes and creates new one on every start and reindex all documents. The repeated process always shows same no. of documents missing in elastic.

What does the index stats API show for the index when you have completed indexing? Do you see any evidence of deleted documents, which would indicate that you have had updates occur due to duplicate IDs?

Index stats API is showing total docs count as 16833817 and deleted as 221074. If we add deleted in total count (i.e. 17054891) then still its not the same no. of docs we're sending to elastic which is 17061816.

Also, under indexing --> "index_total" : 16833817.

I've tried to refresh index using Refresh API, but the stats remains same.

The number of deleted documents will change as segments are merged so they will not necessarily tally up exactly. This however indicates that your IDs are not unique and that you are seeing updates.

If you change to allow Elasticsearch to set the IDs as a test you should see all documents ingested.

I'm trying it and will share results once indexing is done. Meanwhile, on a separate testing I've reduced the batch size from 7500 to 2000 (on Elasticsearch 6.8.23) and I've got the correct no. of count in elastic 6.8.23 which is 17061816. The issue of mistmatch count is coming on elastic 7.17.4. The dataset and application code is same in both ES 6.x and ES 7.x.

Elasticsearch 6.8.23 is EOL and no longer supported. Please upgrade ASAP.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

Did you index into a new index on 6.8.23 or did you use an existing index?

Can you show a sample document?

New index. Every time application starts, it will drop existing indexes and recreate new ones. I test on application restart everytime.

Here is sample document. Data is changed for the privacy purposes.
https://tmpfiles.org/1421890/document.json

I've tested with Elasticsearch ID (not using the custom_id) and the count is still not same. In fact, its 7500 more than what I was previously getting with custom_id. BTW 7500 is a batch size which we're sending to Bulk Request as well. So I think there is something going around with batch size as well.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.