I have been trying to index a huge amount of data about a million documents,
There were two options that i tried,
I am using elasticsearch 2.3.3.
1 ) Asynchronous , which overloaded the ES, since the client was firing index request with 100 document in one bulk request, soo rapidly for the ES unable to keep up at all , resulting that ES started saying IndexRequestRejected , for every request after few seconds , it just indexed first 9-10K documents and then started rejecting every request, even my system got too slow, ate up all the memory in the process.
then I changed , bulk.queue size to 1000 in elasticsearch.yml , no luck, I think this too got full with rapid request fired
Then I put the Thread.sleep(500) (half a second which is a lot slower) , all data was indexed , except for 10 documents , saying the string size of one of the fields was > 32766 UTF8 , I did took into account, and m ok with it.
But putting a sleep in your API is never a good practice, So I dropped that idea.
- Then i switched to Synchronous , which started with a rapid request firing at first , but then slowed down to wait for the ES to index the documents, and then stayed that way , until the last request,
But then i noticed the documents that were indexed were almost around half of the total , I don't know what is issue here ? ,
It didn't print any error logs in log file , not of the request were failed except for some , cause of the string size of one of the fields was > 32766 UTF8 , etc,
I read a whole lot about this on net , n then tried a few things ,
For eg. ,
set replica = 0,
increased the shards to 20 (i don't know if shards no. have anything to do with data loss, i think its only for parallel indexing and should not affect the data consistency),
set refresh time to -1 , and then refreshed after it completed..
but STILL it only indexes half of the data.
M totally confused now ,thinking what to do over this past week