Hive (HDP 2.3) and ES-Hadoop Integration Issue


(Vinod P) #1

Hi,

I'm just trying to integrate Hive running on Hortonworks 2.3 with Elasticsearch 2.0, Kibana 4.2.0 and ES-Hadoop connector (2.2.0 beta1) and ran into the following issues:

  1. Performing a map-reduce job using Tez execution engine completely fails to write to Elasticsearch index. However, changing the execution engine to MR completes successfully. Not sure if this is bug or a config change required on the Hive settings on the Hortonworks side or a bug within es-hadoop JAR while communicating to Hive ? if so, then what would be those settings or is there any other fix available?
  2. Any idea on how to add jar in a HDP 2.3 cluster? I tried the documented approach but couldn't get it to work; I ended up doing it in the script for now.

Thanks,
Vinod


Problem integrating elasticsearch hadoop with Hortonworks sandbox
(Vinod P) #2

Hi,

Anyone has any clue on this please....?

Thanks


(Costin Leau) #3
  1. What kind of error/problem do you encounter when running under Tez? Can you post your configuration and the complete error (potentially as a gist)?

  2. I'm not sure what you mean. At step 1, you indicate you were able to run the job - how is that possible without adding the jar in the classpath? If you are asking on how to add the jar by default in the cluster so you don't have to declare teh jar, that's typically a distro specific question and typically implies copying the jar in the hive/hadoop lib folder and making sure it always gets added to each job classpath.
    This might be what you want or not, as it will affect all jobs.


(Vinod P) #4

Hi Costin,

Thanks for the response.

Below is the error logs for the Tez execution engine issue while performing an insert overwrite operation. Once I set hive.execution.engine = mr on the hive shell prompt then the same insert operation works just fine.

Thanks

Vertex failed, vertexName=Map 1, vertexId=vertex_1447001416210_0046_1_00, diagnostics=[Task failed, taskId=task_1447001416210_0046_1_00_000030, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while runni ng task:java.lang.RuntimeException: java.lang.RuntimeException: Map operator initialization failed
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:345)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Map operator initialization failed
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:229)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:147)
... 14 more
Caused by: java.lang.NoClassDefFoundError: org/apache/commons/httpclient/URIException
at org.elasticsearch.hadoop.hive.HiveUtils.structObjectInspector(HiveUtils.java:57)
at org.elasticsearch.hadoop.hive.EsSerDe.initialize(EsSerDe.java:82)
at org.elasticsearch.hadoop.hive.EsSerDe.initialize(EsSerDe.java:97)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:356)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:362)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:481)
at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:438)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:481)
at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:438)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:481)
at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:438)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at org.apache.hadoop.hive.ql.exec.MapOperator.initializeMapOperator(MapOperator.java:442)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:198)
... 15 more
Caused by: java.lang.ClassNotFoundException: org.apache.commons.httpclient.URIException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 31 more
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:184, Vertex vertex_1447001416210_0046_1_00 [Map 1] killed/failed due to:null]DAG did not succeed due to VERTEX_FAILURE. failedVertice s:1 killedVertices:0


(Costin Leau) #5

The issue is that Tez removes commons-httpclient from the classpath (typically it is available) which causes the CNFE. The immediate workaround is to add this jar back. Do you know what's your job classpath?
It's ironic that jetty & co are included but this jar in particular is not.


(Vinod P) #6

would you know the jar name for this? I can look it up and let you know. I'm fairly new to HDP and not quite sure how to get a value dump of my classpath? any pointers would be helpful to get the current values of the environment variables

Thanks


(Vinod P) #7

ok I got it working finally by adding the jar :smile:

add jar /usr/hdp/2.3.0.0-2557/hive/lib/commons-httpclient-3.0.1.jar

Now, any pointers on how to make sure all these jars available on a cluster level in HDP2.3 now?

Thanks


(Vinod P) #8

her's the value of classpath variable. I don't see hive path in here

/usr/hdp/2.3.0.0-2557/hadoop/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/:/usr/hdp/2.3.0.0-2557/hadoop/.//:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//:::/usr/share/java/mysql-connector-java-5.1.17.jar:/usr/share/java/mysql-connector-java.jar:/usr/hdp/2.3.0.0-2557/tez/:/usr/hdp/2.3.0.0-2557/tez/lib/:/usr/hdp/2.3.0.0-2557/tez/conf


(Costin Leau) #9

Hive is not required - as long as commons-http is in there, you should be fine.
I'm not aware of any easy way to propagate a change like this across the cluster in HDP. I expect it to be there as typically distros have this available in their UI.

Do note that you should be able to add commons-http as part of your job (whether it's mapreduce or tez) if you add it to your Hive script (see the docs). This becomes Hive specific and should work across various distros/versions.


(system) #10