I am using the EsStorage class to store millions of documents from a 40 nodes hadoop cluster to a 3 nodes elastic cluster. It is very convenient, but I found out that I am loosing many documents during this process.
By default, 70 reducers are instantiated and about 5% of the documents are lost.
I reduced manually the number of reducers to 12, and I'm now loosing less than 1% of docs, but I need to reach 0%
I tried to change tthe es.batch.size.bytes and es.batch.size.entries parameters and although this changes the number of lost documents, I'm still far from 0%.
It seems that Pig connector do not verify if the batch of doc was successfully indexed. If a batch failed, it seems not to be retried. Is there any setting parameter to set that I'm missing ?
Actually it does. For every batch written, ES-Hadoop will parse the response and, in case documents are rejected, will retry them (just the rejected documents) - this is configurable but out of the box, up to 3 times with 10s wait in between. If that fails, the job fails as well.
Does the job fail in your case or not? ES-Hadoop exposes metrics so you can see at the end of the job, the stats for it. In addition through Marvel you can double check these in real-time.
No the job does not fail. I didn't know about the metrics, that's interesting. I found them and I can see they are actually some Bulk retries. And the number of Documents Accepted is corresponding to 100% of the documents that I want to index.
I did a new indexation and some new queries. I found a reason that is explaining most of the losses : I have some documents that have the same key. In that case, only one document is kept in ES. Good !
I'd like to do another test with more reducers but not more time unfortunately. It's good enough for now
For info :
ES version : 1.7.3
ES-Hadoop version : 2.2.0-beta1 (we recently upgraded it because we'll soon upgrade ES to 2.0)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.