Running insert into elasticsearch from hive

I have an apache Hadoop instance 2.6.0 and Hive 0.13.1-cdh5.3.2.

I have installed Elasticsearch 1.5.2 from the tar file and elasticsearch starts fine.

I have created a table in Hive for ES:
CREATE EXTERNAL TABLE bicon.artists1 (
name STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'radio/artists1');

I have a Hive table (source) with one row in it.

Im trying to insert a row into elasticsearch:
INSERT OVERWRITE TABLE artists1
SELECT s.name
FROM source s;

I have added elasticsearch-hadoop-2.1.0.BUILD-20150510.023705-396.jar in hadoop and Im refering to it from hive-site.xml.

I also added this jar fil in the hive/lib folder.

It doesnt work for me. I get the following message:
Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

When I dig deeper in Hadoop logs I get the following error (warn):
stderr:
log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Server).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

What am I missing? Thanks in advance!

I have now cloned elastic-search github distribution and rand gradle to create snapshot jar file. I tested with it, no success. Same error.

Does anyone know if log4j.properties must be in the jar file or is the problem somewhere else?

@marko sorry to hear you are having problems. There haven't been any any significant changes between 2.1.0.Beta4 and the snapshot so I would use the Beta.
Placing the jar (instead of using ADD JAR) in the Hive classpath (potentially hive/lib) should work in picking the jar up. The exception that you posted doesn't indicate that the jar is not present, but rather that something did not work while executing the task.
What is your complete script?

One thing that stands out is that your external/Elasticsearch table is called (bicon.artists1) while in your insert you are using artists1. Try using the same name and make sure that the source table source actually has a column named name.

I believe the problem is in my Hadoop installation. Im missing something there. Thank you for your time.

You could try one of the VMs available - it's a somewhat fast way of checking the sanity of a certain Hadoop setup.

I added container-log4j.properties in HADOOP_HOME with the following lines:
log4j.logger.org.apache.hadoop.mapreduce=WARN
log4j.logger.org.apache.hadoop.mapred=WARN
log4j.logger.org.apache.hadoop.yarn=WARN
log4j.logger.org.apache.hadoop.hive=WARN
log4j.security.logger=WARN

I moved on to the next error:
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.impl.MetricsSystemImpl).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Can you copy/paste your container-log4j.properties here, Costin?

I installed elasticsearch on Oracle big data lite VM, added a hadoop elasticsearch jar file in the $HIVE_HOME/lib and restarted and got the same issue.

will try now with clouderas VM.

Or did u have any particular VM in mind, Costin?

Typically log4j expects a log4j.properties while Hive has various options depending on the version used.
Regarding the VM, I was thinking of using a popular one such as Cloudera or Hortonworks simply because there's a lot of material about on it and you can easily debugging it later on.

I tried with Clouderas VM as well:
i installed elsticsearch 1.5.2, started it.

copied elasticsearch-hadoop file to these two locations
/usr/lib/hive/lib/elasticsearch-hadoop-2.0.2.jar
/usr/lib/hadoop/lib/elasticsearch-hadoop-2.0.2.jar

reboot the VM

ran this one in hive
add /home/cloudera/dl/elasticsearch-hadoop-2.0.2/dist/elasticsearch-hadoop-2.0.2.jar

created elasticsearch directory under /user/cloudera

created a table in HUE:
CREATE EXTERNAL TABLE test_es ( id BIGINT, tekst STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = '/user/cloudera/elasticsearch/test_es');

im running insert:
INSERT OVERWRITE TABLE test_es SELECT id, tekst FROM test;

and get the following:

15/05/17 11:06:12 ERROR operation.Operation: Error running hive query:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:147)
at org.apache.hive.service.cli.operation.SQLOperation.access$000(SQLOperation.java:69)
at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:502)
at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:213)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Resource Manager gives me this under Diagnostics:

Application application_1431812993284_0005 failed 2 times due to AM Container for appattempt_1431812993284_0005_000002 exited with exitCode: 1 due to: Exception from container-launch.
Container id: container_1431812993284_0005_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
.Failing this attempt.. Failing the application.

Hope you can help me out of this one.

Can you please try using 2.1.0.Beta4 instead? The recent versions of Hive (included in Cloudera) have some major changes which have been addressed in 2.1 but not (yet) in 2.0.x.

Thanks,

I solved the issue.
I have generated the jar file on my own instance, found an example on rittman blog used that one and it worked.
this is the part I didnt have in my create table:
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'

I really appreciate you taking the time to help me out, Costin. Im working on a POC for one customer and Im really eager to push elasticsearch and kibana, your quick response was definitely a plus!
Thanks again.

Glad to assist. The ROW FORMAT SERDE is not really needed if you use the EsStorageHandler as the SERDE is automatically specified.

Not sure what you mean by generating your own jar - does that or the 'example from rittman blog` had something the Beta4 lacked?

I cloned from github, and ran gradlew. the jar that was built is the one I used.

When it comes to example from rittmans blog, Im beginning to suspect it was es.resource that might have been the problem. As I look at my second to last post, i wrote my hdfs path in the es.resource, not knowing its index and mapping.

Hope I answered your question.

You did. The jar is built nightly and pushed in Maven so one can use that instead of compiling things by hand.

As for using the wrong es.resource - you should get a proper error message however Hive makes things difficult as it swallows the underlying exception and many times the logs need to be checked to get understand the underlying issue.