Missing documents during bulk insert

EsraKahraman · June 21, 2022, 9:02pm

Hi there, I am using bulk API to load data in Elasticsearch. I need to load millions of documents to an elastic index; my ETL job breaks the whole set down to 10K record chunks. And then, it splits each chunk into the JSON documents, and each JSON file has 20 documents. The reason we break at 20 documents is because of our document structure is large and the json file was exceeding the max limit so we took an extreme approach and reduced it to <70KB. ETL job is build in Talend.

Let's take a concrete example, if I want to insert 100K documents, I break that down to 10 chunks, and each chunk will have 500 different JSON files.

After preparing the files, I do a bulk insert with each JSON file, each chunk is run sequentially and once the run it complete, it removes the files of an iteration before going to the next one. /_bulk Api request is made using curl, username password based basic authentication.

ETL job works perfectly with a small set of data. Intermittently, there are missing documents in my index especially when my data load exceeds 50K records. These missing documents do exist in JSON files that were generated for bulk inserts.

I am able to load them by re-running my ETL job for all or for those missing records. And, each time, missing documents can be from a different chunks, or sometimes the job loads all documents to Elasticsearch without a missing document.

I have tried disabling replication and setting refresh_interval to 300s.

I would like to resolve this unpredictable behavior of bulk apis. Any help is much appreciated.

warkolm · June 21, 2022, 10:20pm

Welcome to our community!

Are you parsing the responses from Elasticsearch to see what it is saying?

EsraKahraman · June 22, 2022, 12:55pm

Hi Mark, thank you so much

Yes, I am parsing the responses in local for a good amount of data, and I opened them in the cluster during tests but didn't see any error.

I take a response like this;

warkolm · June 23, 2022, 6:03am

Please don't post pictures of text, logs or code. They are difficult to read, impossible to search and replicate (if it's code), and some people may not be even able to see them

EsraKahraman · June 23, 2022, 2:02pm

Sure, I am adding text version to the below;

{"took":9,"errors":false,"items":[

{"index":{"_index":"bulktest","_id":"31719_1_2022-04-01","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":3,"_primary_term":1,"status":201}},

{"index":{"_index":"bulktest","_id":"31741_1_2022-04-01","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":4,"_primary_term":1,"status":201}},

{"index":{"_index":"bulktest","_id":"31753_1_2022-04-01","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":5,"_primary_term":1,"status":201}}]}

system · July 21, 2022, 2:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Records missing from Elastic Index when running our Talend ETL process Elasticsearch docker	1	294	November 17, 2022
Queries on Elastic Search Configuration and Bulk Import Elasticsearch	1	339	July 6, 2017
Elasticsearch bulk index missing some records Elasticsearch	18	3755	August 2, 2018
Bulk API Insert Data missing Elasticsearch language-clients	4	1488	October 18, 2021
Document lost or not indexed during bulk index Elasticsearch	4	1648	July 23, 2020

Missing documents during bulk insert

Related topics