ES - Amazon EMR - Pig

(Ayan Guha) #1


I am using Pig to connect to ES for both reading and writing data.

[hadoop mapreduce]$ hadoop version
Hadoop 2.4.0-amzn-4
Subversion -r 8aaaf366c7fe8ea6d8c37e76c2cc8caa10e11c06
Compiled by Elastic MapReduce on 2015-04-02T21:19Z
Compiled with protoc 2.5.0
From source with checksum 6c725ed23b3ecb95921fe461587fccf
This command was run using /home/hadoop/.versions/2.4.0-amzn-4/share/hadoop/common/hadoop-common-2.4.0-amzn-4.jar
[hadoop mapreduce]$ pig -i
Apache Pig version 0.12.0 (rexported)
compiled Jan 24 2015, 01:40:48
[hadoop mapreduce]$

I have written this small script and trying in grunt shell

register '/home/hadoop/ayan/elasticsearch-hadoop-pig-2.1.0.jar';
DEFINE EsStorage org.elasticsearch.hadoop.pig.EsStorage('es.nodes=xx.xx.xx.xx','es.port=9200');
a = load '/dem/getShipmentById/' using EsStorage();
b = limit a 3;
dump b;

It is failing with following error

2015-07-24 05:59:38,525 [main] ERROR - ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1433395069625_0537_m_000000_3 Info:Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
2015-07-24 05:59:38,525 [main] ERROR - 1 map reduce job(s) failed!

I understand it is trying to invoke old map-reduce api but why so and what is the solution?

(Costin Leau) #2

Unfortunately a bug sneaked into the 2.1.0 release in the Pig integration. It has been fixed in master - we plan to release a 2.1.1 shortly to address this however in the meantime you can use the dev[release] 2, in particular 2.1.1.BUILD-SNAPSHOT

(Rangan Roy) #3

My pig script is :

REGISTER elasticsearch-hadoop-2.4.0.jar
REGISTER piggybank-0.15.0.jar
DEFINE EsStorage org.elasticsearch.hadoop.pig.EsStorage();
logs = load 'second_mapping_data.json' using JsonLoader('addr: chararray, logname: chararray, user: chararray, time: chararray, method: chararray, uri: chararray, proto: chararray, status: chararray, bytes: chararray');
STORE logs INTO 'test_index/logsdetails' USING org.elasticsearch.hadoop.pig.EsStorage('es.nodes=endpoint_of_aws_elasticsearch_cluster','es.nodes.wan.only=true');

When I'm trying this using my grunt shell data is not going to aws elasticsearch service . Can you please tell me what is that I'm missing ? I'm uploading a snapshot of the error .


I have full access to AWS ES .

Error :
1st part of error is attached in the picture :

2016-09-23 03:08:10,606 [Thread-11] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent
2016-09-23 03:08:10,606 [Thread-11] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2016-09-23 03:08:10,606 [Thread-11] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2016-09-23 03:08:10,606 [Thread-11] INFO org.apache.hadoop.mapred.LocalJobRunner - OutputCommitter is org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter
2016-09-23 03:08:10,609 [Thread-11] INFO org.apache.hadoop.mapred.LocalJobRunner - Waiting for

2016-09-23 03:08:10,852 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases logs
2016-09-23 03:08:10,852 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: logs[2,7] C: R:
2016-09-23 03:08:10,854 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
java.lang.Exception: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(
at org.apache.hadoop.mapred.LocalJobRunner$
Caused by: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.pig.EsStorage.putNext(
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(

Caused by: Connection error (check network and/or proxy settings)- all nodes failed; tried [[]]

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0-cdh5.5.0 0.12.0-cdh5.5.0 root 2016-09-23 03:08:10 2016-09-23 03:12:23 UNKNOWN


Failed Jobs:
JobId Alias Feature Message Outputs
job_local685037827_0002 logs MAP_ONLY Message: Job failed! test_index/logsdtls,

Failed to read data from "file:///home/cloudera/Desktop/Satish/json/second_mapping_data.json"

Failed to produce result in "test_index/logsdtls"

Job DAG:

2016-09-23 03:12:23,607 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
grunt> 2016-09-23 03:12:28,648 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - map > map

(system) #4