[hadoop] Extra Documents in Elastic Search

Hi,

I'm trying to store a lot of documents into ES using pig. The pig job ends
successfully but I end up with more documents in Elasticsearch than the
number of rows in my input.
My pig script is 3 lines:
REGISTER 'local/path/to/m2.jar'
data = load 'path/to/hdfs/file.tsv' as (field1: chararray, field2: long,
field3: long, field4: long)
store data into 'index/type' using
org.elasticsearch.hadoop.pig.EsStorage('es.nodes=node2.domain.com',
'es.rersource=index/type');

I have speculative execution disabled for map and reduce when running this
pig script.

Hadoop states that 54,723,557 records were written (console output and job
tracker UI).
ES head plugin claims that I have docs: 57,344,987 (57,344,987).

My environment:
hadoop: 1.2.1 with 6 nodes cluster
elasticsearch: 1.0.0. 6 node cluster. Different than hadoop nodes.
elasticsearch-hadoop version M2.
Pig version: 0.12.0

Any ideas of what is going on here?

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a852acc3-c331-491b-817d-4386493aec90%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I did not entirely solve this issue. But it looks like ES is dropping some
requests when it's overloaded. As my hadoop cluster can handle 42 mappers,
I had 42 tasks trying to send write requests to only 1 ES node (I believe
all the requests only go to one node is ES). Most of the time, many tasks
will fail and my hadoop job will fail. But sometimes, hadoop returns a
success and not all the data has been successfully written.
Reducing the number of mappers should have helped, but for some reasons
running pig with the property -Dmapred.tasktracker.map.tasks.maximum=1 did
not do the trick.
Limiting the number of mappers directly in the cluster conf files seems to
have solved the problem.

On Wednesday, April 23, 2014 4:15:19 PM UTC-5, Napoleon T. wrote:

Hi,

I'm trying to store a lot of documents into ES using pig. The pig job ends
successfully but I end up with more documents in Elasticsearch than the
number of rows in my input.
My pig script is 3 lines:
REGISTER 'local/path/to/m2.jar'
data = load 'path/to/hdfs/file.tsv' as (field1: chararray, field2: long,
field3: long, field4: long)
store data into 'index/type' using
org.elasticsearch.hadoop.pig.EsStorage('es.nodes=node2.domain.com',
'es.rersource=index/type');

I have speculative execution disabled for map and reduce when running this
pig script.

Hadoop states that 54,723,557 records were written (console output and
job tracker UI).
ES head plugin claims that I have docs: 57,344,987 (57,344,987).

My environment:
hadoop: 1.2.1 with 6 nodes cluster
elasticsearch: 1.0.0. 6 node cluster. Different than hadoop nodes.
elasticsearch-hadoop version M2.
Pig version: 0.12.0

Any ideas of what is going on here?

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b819e576-f0ef-41d4-854a-63bab811951a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi,
I strongly recommend using the latest release (es-hadoop 2.0 RC1) which
handles document rejections (which can and will happen when ES is
overloaded). Simply replace the jar and start adding more tasks until you
get the desired performance. Know that es-hadoop also records stats about
the job (docs rejected, sent, accepted, etc...) which are displayed at the
end of the job so you can use that information as well to double-check the
number of docs added.

Cheers,

On Thu, May 15, 2014 at 10:37 PM, Napoleon T. napoleon13e@gmail.com wrote:

I did not entirely solve this issue. But it looks like ES is dropping some
requests when it's overloaded. As my hadoop cluster can handle 42 mappers,
I had 42 tasks trying to send write requests to only 1 ES node (I believe
all the requests only go to one node is ES). Most of the time, many tasks
will fail and my hadoop job will fail. But sometimes, hadoop returns a
success and not all the data has been successfully written.
Reducing the number of mappers should have helped, but for some reasons
running pig with the property -Dmapred.tasktracker.map.tasks.maximum=1 did
not do the trick.
Limiting the number of mappers directly in the cluster conf files seems to
have solved the problem.

On Wednesday, April 23, 2014 4:15:19 PM UTC-5, Napoleon T. wrote:

Hi,

I'm trying to store a lot of documents into ES using pig. The pig job
ends successfully but I end up with more documents in Elasticsearch than
the number of rows in my input.
My pig script is 3 lines:
REGISTER 'local/path/to/m2.jar'
data = load 'path/to/hdfs/file.tsv' as (field1: chararray, field2: long,
field3: long, field4: long)
store data into 'index/type' using org.elasticsearch.hadoop.pig.
EsStorage('es.nodes=node2.domain.com', 'es.rersource=index/type');

I have speculative execution disabled for map and reduce when running
this pig script.

Hadoop states that 54,723,557 records were written (console output and
job tracker UI).
ES head plugin claims that I have docs: 57,344,987 (57,344,987).

My environment:
hadoop: 1.2.1 with 6 nodes cluster
elasticsearch: 1.0.0. 6 node cluster. Different than hadoop nodes.
elasticsearch-hadoop version M2.
Pig version: 0.12.0

Any ideas of what is going on here?

Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b819e576-f0ef-41d4-854a-63bab811951a%40googlegroups.com
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJogdmc5NOAmsjhVca2c%2B%3D5pCdHPPfyZPDPgagD5PWM3%3Da09Pg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.