Our Java 8 based application is sending an input data of total 17061816 documents to elasticsearch 7.17.4 to index these documents. However, after all indexing is completed, the curl _count is showing a total of 16833817 indexed documents. There are no errors in elasticsearch log as well. I've enabled the DEBUG log at root logger but still no significant error is shown. I am not sure what could be the reason here? The input data is the count of distinct records as well.
Using custom document_id. Still we can't figure out what could be the cause as there are no errors in elastic logs as well as in application logs. The custom_id is a unique default id generated auto generated by Neo4j. We're sending 7500 in a batch for indexing to Bulk Request API. There are no errors or exceptions received from elastic. The application drops all indexes and creates new one on every start and reindex all documents. The repeated process always shows same no. of documents missing in elastic.
What does the index stats API show for the index when you have completed indexing? Do you see any evidence of deleted documents, which would indicate that you have had updates occur due to duplicate IDs?
Index stats API is showing total docs count as 16833817 and deleted as 221074. If we add deleted in total count (i.e. 17054891) then still its not the same no. of docs we're sending to elastic which is 17061816.
Also, under indexing --> "index_total" : 16833817.
I've tried to refresh index using Refresh API, but the stats remains same.
The number of deleted documents will change as segments are merged so they will not necessarily tally up exactly. This however indicates that your IDs are not unique and that you are seeing updates.
If you change to allow Elasticsearch to set the IDs as a test you should see all documents ingested.
I'm trying it and will share results once indexing is done. Meanwhile, on a separate testing I've reduced the batch size from 7500 to 2000 (on Elasticsearch 6.8.23) and I've got the correct no. of count in elastic 6.8.23 which is 17061816. The issue of mistmatch count is coming on elastic 7.17.4. The dataset and application code is same in both ES 6.x and ES 7.x.
I've tested with Elasticsearch ID (not using the custom_id) and the count is still not same. In fact, its 7500 more than what I was previously getting with custom_id. BTW 7500 is a batch size which we're sending to Bulk Request as well. So I think there is something going around with batch size as well.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.