I'm trying to store a lot of documents into ES using pig. The pig job ends
successfully but I end up with more documents in Elasticsearch than the
number of rows in my input.
My pig script is 3 lines:
REGISTER 'local/path/to/m2.jar'
data = load 'path/to/hdfs/file.tsv' as (field1: chararray, field2: long,
field3: long, field4: long)
store data into 'index/type' using
org.elasticsearch.hadoop.pig.EsStorage('es.nodes=node2.domain.com',
'es.rersource=index/type');
I have speculative execution disabled for map and reduce when running this
pig script.
Hadoop states that 54,723,557 records were written (console output and job
tracker UI).
ES head plugin claims that I have docs: 57,344,987 (57,344,987).
My environment:
hadoop: 1.2.1 with 6 nodes cluster
elasticsearch: 1.0.0. 6 node cluster. Different than hadoop nodes.
elasticsearch-hadoop version M2.
Pig version: 0.12.0
I did not entirely solve this issue. But it looks like ES is dropping some
requests when it's overloaded. As my hadoop cluster can handle 42 mappers,
I had 42 tasks trying to send write requests to only 1 ES node (I believe
all the requests only go to one node is ES). Most of the time, many tasks
will fail and my hadoop job will fail. But sometimes, hadoop returns a
success and not all the data has been successfully written.
Reducing the number of mappers should have helped, but for some reasons
running pig with the property -Dmapred.tasktracker.map.tasks.maximum=1 did
not do the trick.
Limiting the number of mappers directly in the cluster conf files seems to
have solved the problem.
On Wednesday, April 23, 2014 4:15:19 PM UTC-5, Napoleon T. wrote:
Hi,
I'm trying to store a lot of documents into ES using pig. The pig job ends
successfully but I end up with more documents in Elasticsearch than the
number of rows in my input.
My pig script is 3 lines:
REGISTER 'local/path/to/m2.jar'
data = load 'path/to/hdfs/file.tsv' as (field1: chararray, field2: long,
field3: long, field4: long)
store data into 'index/type' using
org.elasticsearch.hadoop.pig.EsStorage('es.nodes=node2.domain.com',
'es.rersource=index/type');
I have speculative execution disabled for map and reduce when running this
pig script.
Hadoop states that 54,723,557 records were written (console output and
job tracker UI).
ES head plugin claims that I have docs: 57,344,987 (57,344,987).
My environment:
hadoop: 1.2.1 with 6 nodes cluster
elasticsearch: 1.0.0. 6 node cluster. Different than hadoop nodes.
elasticsearch-hadoop version M2.
Pig version: 0.12.0
Hi,
I strongly recommend using the latest release (es-hadoop 2.0 RC1) which
handles document rejections (which can and will happen when ES is
overloaded). Simply replace the jar and start adding more tasks until you
get the desired performance. Know that es-hadoop also records stats about
the job (docs rejected, sent, accepted, etc...) which are displayed at the
end of the job so you can use that information as well to double-check the
number of docs added.
I did not entirely solve this issue. But it looks like ES is dropping some
requests when it's overloaded. As my hadoop cluster can handle 42 mappers,
I had 42 tasks trying to send write requests to only 1 ES node (I believe
all the requests only go to one node is ES). Most of the time, many tasks
will fail and my hadoop job will fail. But sometimes, hadoop returns a
success and not all the data has been successfully written.
Reducing the number of mappers should have helped, but for some reasons
running pig with the property -Dmapred.tasktracker.map.tasks.maximum=1 did
not do the trick.
Limiting the number of mappers directly in the cluster conf files seems to
have solved the problem.
On Wednesday, April 23, 2014 4:15:19 PM UTC-5, Napoleon T. wrote:
Hi,
I'm trying to store a lot of documents into ES using pig. The pig job
ends successfully but I end up with more documents in Elasticsearch than
the number of rows in my input.
My pig script is 3 lines:
REGISTER 'local/path/to/m2.jar'
data = load 'path/to/hdfs/file.tsv' as (field1: chararray, field2: long,
field3: long, field4: long)
store data into 'index/type' using org.elasticsearch.hadoop.pig.
EsStorage('es.nodes=node2.domain.com', 'es.rersource=index/type');
I have speculative execution disabled for map and reduce when running
this pig script.
Hadoop states that 54,723,557 records were written (console output and
job tracker UI).
ES head plugin claims that I have docs: 57,344,987 (57,344,987).
My environment:
hadoop: 1.2.1 with 6 nodes cluster
elasticsearch: 1.0.0. 6 node cluster. Different than hadoop nodes.
elasticsearch-hadoop version M2.
Pig version: 0.12.0
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.