Elasticsearch - Data loss while reindexing (scan and bulk insert)

Hi,

We were trying to reindex our index so that we could increase the number of
shards in our index. The number of shards in our index is 20, and we wanted
to make it 500.

So, we used Tire gem, reindex method for doing that (it basically make a
scan search on the index, scroll through the index and then bulk insert for
each scroll). But, we found that in our dev environment which has about 250
thousand documents:

  • When we do reindex with default size (which is 10, so 20 shards means
    200 documents), we are getting data loss (only 50 to 60% were being indexed
    in the new index).
  • When we tried it with scan API and then inserting it one by one, it
    was inserting without data loss, but this obviously takes more time.
  • When we tried with size 1, (20 documents at a time), there was no data
    loss

So, we went ahead and tried with size 1 (20 documents at a time), in the
production environment (which has about 30 million records). We found that:

  • Even for size 1 ( 20 documents at a time), there was dataloss (we
    indexed around 220 thousand documents, and only 190 thousand documents were
    indexed. 30 thousand were lost) and it was also slow, so we had to stop in
    between.

Why is this data loss happening during bulk insert?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e991d917-be19-4fe5-8938-70df53cd3cde%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

By any chance, did you see any errors in the logs? It might be possible
that some of the bulk calls were rejected.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3e64e2c2-1f3d-422d-b5cb-9e3953b1d4cf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I was not able to see any errors in the logs. I think it is not the case of
one bulk request being completely rejected or errored out. I see bulk
requests being completed in the logs, but it seems only some of the
documents are getting indexed during one bulk request. I can see the count
increasing only upto around 150 during one bulk request of 200 documents.

It seems that if we decrease the number of shards of the index to 50, there
is no data loss. But increasing it to 100 or 500 shards is making the bulk
insertion to have data loss during insertion.

On Thursday, March 27, 2014 6:12:41 PM UTC+5:30, Binh Ly wrote:

By any chance, did you see any errors in the logs? It might be possible
that some of the bulk calls were rejected.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5e805ecf-ba21-4023-97b0-fbbc3a84411d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.