Pig - Lost documents while storing with EsStorage


I am using the EsStorage class to store millions of documents from a 40 nodes hadoop cluster to a 3 nodes elastic cluster. It is very convenient, but I found out that I am loosing many documents during this process.

By default, 70 reducers are instantiated and about 5% of the documents are lost.
I reduced manually the number of reducers to 12, and I'm now loosing less than 1% of docs, but I need to reach 0% :slight_smile:

I tried to change tthe es.batch.size.bytes and es.batch.size.entries parameters and although this changes the number of lost documents, I'm still far from 0%.

It seems that Pig connector do not verify if the batch of doc was successfully indexed. If a batch failed, it seems not to be retried. Is there any setting parameter to set that I'm missing ?

Thanks for your help :slight_smile:


Actually it does. For every batch written, ES-Hadoop will parse the response and, in case documents are rejected, will retry them (just the rejected documents) - this is configurable but out of the box, up to 3 times with 10s wait in between. If that fails, the job fails as well.

Does the job fail in your case or not? ES-Hadoop exposes metrics so you can see at the end of the job, the stats for it. In addition through Marvel you can double check these in real-time.

What version of ES and ES-Hadoop are you using?

Thanks for your reply.

No the job does not fail. I didn't know about the metrics, that's interesting. I found them and I can see they are actually some Bulk retries. And the number of Documents Accepted is corresponding to 100% of the documents that I want to index.

I did a new indexation and some new queries. I found a reason that is explaining most of the losses : I have some documents that have the same key. In that case, only one document is kept in ES. Good !

I'd like to do another test with more reducers but not more time unfortunately. It's good enough for now :slight_smile:

For info :
ES version : 1.7.3
ES-Hadoop version : 2.2.0-beta1 (we recently upgraded it because we'll soon upgrade ES to 2.0)

Thanks !

Glad to hear it's sorted out.