Hi there, I am using bulk API to load data in Elasticsearch. I need to load millions of documents to an elastic index; my ETL job breaks the whole set down to 10K record chunks. And then, it splits each chunk into the JSON documents, and each JSON file has 20 documents. The reason we break at 20 documents is because of our document structure is large and the json file was exceeding the max limit so we took an extreme approach and reduced it to <70KB. ETL job is build in Talend.
Let's take a concrete example, if I want to insert 100K documents, I break that down to 10 chunks, and each chunk will have 500 different JSON files.
After preparing the files, I do a bulk insert with each JSON file, each chunk is run sequentially and once the run it complete, it removes the files of an iteration before going to the next one. /_bulk Api request is made using curl, username password based basic authentication.
ETL job works perfectly with a small set of data. Intermittently, there are missing documents in my index especially when my data load exceeds 50K records. These missing documents do exist in JSON files that were generated for bulk inserts.
I am able to load them by re-running my ETL job for all or for those missing records. And, each time, missing documents can be from a different chunks, or sometimes the job loads all documents to Elasticsearch without a missing document.
I have tried disabling replication and setting refresh_interval to 300s.
I would like to resolve this unpredictable behavior of bulk apis. Any help is much appreciated.