Hive (HDP 2.3) and ES-Hadoop Integration Issue

vinodp · November 9, 2015, 10:48pm

Hi,

I'm just trying to integrate Hive running on Hortonworks 2.3 with Elasticsearch 2.0, Kibana 4.2.0 and ES-Hadoop connector (2.2.0 beta1) and ran into the following issues:

Performing a map-reduce job using Tez execution engine completely fails to write to Elasticsearch index. However, changing the execution engine to MR completes successfully. Not sure if this is bug or a config change required on the Hive settings on the Hortonworks side or a bug within es-hadoop JAR while communicating to Hive ? if so, then what would be those settings or is there any other fix available?
Any idea on how to add jar in a HDP 2.3 cluster? I tried the documented approach but couldn't get it to work; I ended up doing it in the script for now.

Thanks,
Vinod

vinodp · November 10, 2015, 6:28pm

Hi,

Anyone has any clue on this please....?

Thanks

costin · November 12, 2015, 10:43am

What kind of error/problem do you encounter when running under Tez? Can you post your configuration and the complete error (potentially as a gist)?
I'm not sure what you mean. At step 1, you indicate you were able to run the job - how is that possible without adding the jar in the classpath? If you are asking on how to add the jar by default in the cluster so you don't have to declare teh jar, that's typically a distro specific question and typically implies copying the jar in the hive/hadoop lib folder and making sure it always gets added to each job classpath.
This might be what you want or not, as it will affect all jobs.

vinodp · November 13, 2015, 6:40pm

Hi Costin,

Thanks for the response.

Below is the error logs for the Tez execution engine issue while performing an insert overwrite operation. Once I set hive.execution.engine = mr on the hive shell prompt then the same insert operation works just fine.

Thanks

Vertex failed, vertexName=Map 1, vertexId=vertex_1447001416210_0046_1_00, diagnostics=[Task failed, taskId=task_1447001416210_0046_1_00_000030, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while runni ng task:java.lang.RuntimeException: java.lang.RuntimeException: Map operator initialization failed
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:345)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Map operator initialization failed
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:229)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:147)
... 14 more
Caused by: java.lang.NoClassDefFoundError: org/apache/commons/httpclient/URIException
at org.elasticsearch.hadoop.hive.HiveUtils.structObjectInspector(HiveUtils.java:57)
at org.elasticsearch.hadoop.hive.EsSerDe.initialize(EsSerDe.java:82)
at org.elasticsearch.hadoop.hive.EsSerDe.initialize(EsSerDe.java:97)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:356)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:362)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:481)
at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:438)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:481)
at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:438)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:481)
at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:438)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at org.apache.hadoop.hive.ql.exec.MapOperator.initializeMapOperator(MapOperator.java:442)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:198)
... 15 more
Caused by: java.lang.ClassNotFoundException: org.apache.commons.httpclient.URIException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 31 more
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:184, Vertex vertex_1447001416210_0046_1_00 [Map 1] killed/failed due to:null]DAG did not succeed due to VERTEX_FAILURE. failedVertice s:1 killedVertices:0

costin · November 14, 2015, 7:38pm

The issue is that Tez removes commons-httpclient from the classpath (typically it is available) which causes the CNFE. The immediate workaround is to add this jar back. Do you know what's your job classpath?
It's ironic that jetty & co are included but this jar in particular is not.

vinodp · November 14, 2015, 8:38pm

would you know the jar name for this? I can look it up and let you know. I'm fairly new to HDP and not quite sure how to get a value dump of my classpath? any pointers would be helpful to get the current values of the environment variables

Thanks

vinodp · November 14, 2015, 8:49pm

ok I got it working finally by adding the jar

add jar /usr/hdp/2.3.0.0-2557/hive/lib/commons-httpclient-3.0.1.jar

Now, any pointers on how to make sure all these jars available on a cluster level in HDP2.3 now?

Thanks

vinodp · November 14, 2015, 11:44pm

her's the value of classpath variable. I don't see hive path in here

/usr/hdp/2.3.0.0-2557/hadoop/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/:/usr/hdp/2.3.0.0-2557/hadoop/.//:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//:::/usr/share/java/mysql-connector-java-5.1.17.jar:/usr/share/java/mysql-connector-java.jar:/usr/hdp/2.3.0.0-2557/tez/:/usr/hdp/2.3.0.0-2557/tez/lib/:/usr/hdp/2.3.0.0-2557/tez/conf

costin · November 16, 2015, 4:26pm

Hive is not required - as long as commons-http is in there, you should be fine.
I'm not aware of any easy way to propagate a change like this across the cluster in HDP. I expect it to be there as typically distros have this available in their UI.

Do note that you should be able to add commons-http as part of your job (whether it's mapreduce or tez) if you add it to your Hive script (see the docs). This becomes Hive specific and should work across various distros/versions.

Topic		Replies	Views
ES-Hadoop on HDInsight cluster Elasticsearch es-hadoop	8	3438	August 18, 2017
Can't integrate Elasticsearch with Hive Elasticsearch	6	808	July 6, 2017
Unable to get elasticsearch-hadoop working with Hive/Beeline Elasticsearch	4	2290	July 6, 2017
[HADOOP] Elasticsearch and hive Elasticsearch	10	707	July 6, 2017
Failed to running hive job with CDH 5.1.2 and ES-Hadoop 2.0.0 Elasticsearch	4	982	July 6, 2017

Hive (HDP 2.3) and ES-Hadoop Integration Issue

Related topics