Pig - Lost documents while storing with EsStorage

pierregc · January 12, 2016, 3:13pm

Hi,

I am using the EsStorage class to store millions of documents from a 40 nodes hadoop cluster to a 3 nodes elastic cluster. It is very convenient, but I found out that I am loosing many documents during this process.

By default, 70 reducers are instantiated and about 5% of the documents are lost.
I reduced manually the number of reducers to 12, and I'm now loosing less than 1% of docs, but I need to reach 0%

I tried to change tthe es.batch.size.bytes and es.batch.size.entries parameters and although this changes the number of lost documents, I'm still far from 0%.

It seems that Pig connector do not verify if the batch of doc was successfully indexed. If a batch failed, it seems not to be retried. Is there any setting parameter to set that I'm missing ?

Thanks for your help

Pierre

costin · January 12, 2016, 3:31pm

Actually it does. For every batch written, ES-Hadoop will parse the response and, in case documents are rejected, will retry them (just the rejected documents) - this is configurable but out of the box, up to 3 times with 10s wait in between. If that fails, the job fails as well.

Does the job fail in your case or not? ES-Hadoop exposes metrics so you can see at the end of the job, the stats for it. In addition through Marvel you can double check these in real-time.

What version of ES and ES-Hadoop are you using?

pierregc · January 13, 2016, 10:16am

Thanks for your reply.

No the job does not fail. I didn't know about the metrics, that's interesting. I found them and I can see they are actually some Bulk retries. And the number of Documents Accepted is corresponding to 100% of the documents that I want to index.

I did a new indexation and some new queries. I found a reason that is explaining most of the losses : I have some documents that have the same key. In that case, only one document is kept in ES. Good !

I'd like to do another test with more reducers but not more time unfortunately. It's good enough for now

For info :
ES version : 1.7.3
ES-Hadoop version : 2.2.0-beta1 (we recently upgraded it because we'll soon upgrade ES to 2.0)

Thanks !

costin · January 13, 2016, 10:28am

Glad to hear it's sorted out.

Topic		Replies	Views
[hadoop] Extra Documents in Elastic Search Elasticsearch	3	356	July 6, 2017
Extra documents in Elastic Search Elasticsearch	1	361	July 6, 2017
[Hadoop][Pig] Timeout issues indexing data Elasticsearch	2	782	July 6, 2017
ES - Amazon EMR - Pig Elasticsearch es-hadoop	3	2412	July 6, 2017
Storing into Elasticsearch using Apache Pig Elasticsearch es-hadoop	17	1588	July 6, 2017

Pig - Lost documents while storing with EsStorage

Related topics