We were trying to reindex our index so that we could increase the number of
shards in our index. The number of shards in our index is 20, and we wanted
to make it 500.
So, we used Tire gem, reindex method for doing that (it basically make a
scan search on the index, scroll through the index and then bulk insert for
each scroll). But, we found that in our dev environment which has about 250
thousand documents:
When we do reindex with default size (which is 10, so 20 shards means
200 documents), we are getting data loss (only 50 to 60% were being indexed
in the new index).
When we tried it with scan API and then inserting it one by one, it
was inserting without data loss, but this obviously takes more time.
When we tried with size 1, (20 documents at a time), there was no data
loss
So, we went ahead and tried with size 1 (20 documents at a time), in the
production environment (which has about 30 million records). We found that:
Even for size 1 ( 20 documents at a time), there was dataloss (we
indexed around 220 thousand documents, and only 190 thousand documents were
indexed. 30 thousand were lost) and it was also slow, so we had to stop in
between.
Why is this data loss happening during bulk insert?
I was not able to see any errors in the logs. I think it is not the case of
one bulk request being completely rejected or errored out. I see bulk
requests being completed in the logs, but it seems only some of the
documents are getting indexed during one bulk request. I can see the count
increasing only upto around 150 during one bulk request of 200 documents.
It seems that if we decrease the number of shards of the index to 50, there
is no data loss. But increasing it to 100 or 500 shards is making the bulk
insertion to have data loss during insertion.
On Thursday, March 27, 2014 6:12:41 PM UTC+5:30, Binh Ly wrote:
By any chance, did you see any errors in the logs? It might be possible
that some of the bulk calls were rejected.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.