Elasticsearch Hadoop on HDInsight

I searched the topics, posts, users and categories and no results popped up for HDInsight. It looks like an exotic concern to imagine Elasticsearch running inside a Windows based hadoop cluster. I'm curious to find out if it's possible because I noticed that Solr could be installed on an HDInsight cluster.

Well, apparently and unfortunately, it seems there's no such a big interest in this challenge so I thought I should give it a try myself. I created an HDInsight cluster having the following parameters:

  • cluster type: HBase
  • operating system: Windows Server 2012 R2 Datacenter
  • HDInsight version: 3.2 (HDP 2.2, Hadoop 2.6, HBase 0.98.4)
  • 4 data nodes
    I downloaded elasticsearch-hadoop-2.1.0.zip on the head node, extracted its content and tried to follow the guidelines found under Elasticsearch on Yarn usage section but came across a showstopper in the early phase: I got a java.lang.IllegalStateException while trying to provision Elasticsearch into HDFS using command "hadoop jar elasticsearch-yarn-2.1.0.jar -install-es". I also tried to install an older version of elasticsearch but es.version is rejected with an IllegalArgumentException when using it on Windows.
    I noticed that Java version in HDInsight cluster is 1.7.0_55.
    The same unsuccessful behavior is experienced locally with HDInsight emulator (running a hadoop cluster type); Java version: 1.7.0_65
    It works though in HDP (Hortonworks Data Platform sandbox) 2.3 Tech Preview; Java version 1.7.0_79 (running on CentOS)
    it works locally as well (in HDInsight emulator running on Windows 8.1 Pro) after installing the latest version of Java (1.8.0_45) and changing JAVA_PATH accordingly (to point to the new jre).
    So the primary conclusion on this topic would be: not yet. It looks like there are some version dependencies in the latest es-hadoop jars which cannot be resolved in the current HDInsight environment and at the same time the Elasticsearch hadoop is not that friendly with Windows either (es.version param is recognized on CentOS).
    As a post scriptum I would like to mention that I am a newbie in this field and I would really appreciate some competent opinions regarding this concern of mine: running Elasticsearch on Yarn under HDInsight, Windows based, along with HBase and hopefully sharing the same cluster.

The es-hadoop connector allows you to pass data between the two for processing, but you cannot install ES onto hadoop,

I did manage to install it in 2 hadoop environments: HDP 2.3 Tech preview and HDInsight emulator (after installing the latest version of Java) but couldn't make it work in the cloud due to some version discrepancies I presume. I would be interested to see it working in the cloud because locally the test environments I evaluated are single node clusters and cannot be extended to multiple nodes as it should be the case in production.

Thanks for the write-up.

Running ES on hosting services typically requires a dedicated integration to take full-advantage of the particularities of each environment. the es-yarn integration relies only on the public yarn api - different implementations or extensions might/can extend the API where possible.

Can you expand on what the error was? Potentially raise an issue - it's unclear whether it was a parsing issue or something related to the HDInsight environment. Even if the exception still occurs, the message should be clear enough that the user understands what is going on.
It sounds like it is related to the Java version used - this might be a triggered by Elasticsearch; if it detects a known JVM version that leads to corruption, it will do a hard stop as otherwise there are no guarantees regarding data safety.

es-yarn should work on Windows just like in Linux; in fact, I'm developing on Windows myself :slight_smile: As for the dependencies, es-yarn relies on Hadoop libraries to interact with YARN; if a certain version is used, the same Hadoop jars are required by es-yarn as well simply because the RPC calls might fail otherwise.

Without seeing the actual exceptions/error messages, I cannot tell for sure what is going on...

Hi Costin,
Thank you for your involvement and patience.
Don't take offence of my innuendo but should I understand from your first phrase that you do not intend to provide a dedicated integration with HDInsight? Please note that Solr is already present in that environment through an action script which can be applied to the provisioning phase for any type of HDInsight cluster.
Regarding the versions, I have already mentioned them in my second post but just to summarize, here are the values for HDInsight on Azure:

  • Hadoop: 2.6
  • Java: 1.7.0_55
    Thank you again for your support and hoping I'll hear some good news from you soon I just paste the error log here (as there are no means to attach something :slight_smile: ):

C:\apps\dist>hadoop jar elasticsearch-hadoop-2.1.0\dist\elasticsearch-yarn-2.1.0.jar -install-es
Abnormal execution:Cannot upload C:\apps\dist.\downloads\elasticsearch-1.6.0.zip in HDFS at apps/elasticsearch/elasticsearch-1.6.0.zip
java.lang.IllegalStateException: Cannot upload C:\apps\dist.\downloads\elasticsearch-1.6.0.zip in HDFS at /apps/elasticsearch/elasticsearch-1.6.0.zip
at org.elasticsearch.hadoop.yarn.cli.YarnBootstrap.install(YarnBootstrap.java:144)
at org.elasticsearch.hadoop.yarn.cli.YarnBootstrap.installEs(YarnBootstrap.java:130)
at org.elasticsearch.hadoop.yarn.cli.YarnBootstrap.run(YarnBootstrap.java:94)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.elasticsearch.hadoop.yarn.cli.YarnBootstrap.main(YarnBootstrap.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Incomplete HDFS URI, no host: hdfs:///
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:142)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2619)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2635)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.elasticsearch.hadoop.yarn.cli.YarnBootstrap.install(YarnBootstrap.java:136)
... 11 more

If there's going to be such an integration it will not be part of es-hadoop connector simply because it is outside its scope. It will likely be something that either is provided by HDInsight (Microsoft) or a plugin similar to cloud-aws for example.
If you're looking in terms of provisioning, then maybe the puppet module would be useful there.

As for the exception, if you look at the root cause, you'll notice it's the Hadoop API that complains about the location, in particular the fact that it is not a valid HDFS resource:

Caused by: java.io.IOException: Incomplete HDFS URI, no host: hdfs:///
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:142)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2619)

It looks like a bug (though we do pass the URI as a whole) - can you try specifying the URI manually and using the file:// prefix?

hadoop jar elasticsearch-yarn-2.1.0.jar -install-es hdfs.upload.dir=file:///

or

hadoop jar elasticsearch-yarn-2.1.0.jar -install-es hdfs.upload.dir=hdfs://myhost/

Basically it looks like the local HDFS file-system is not properly configured to deal with the local file-system/path used.

Hello,
I have already tried to configure Elasticsearch on Yarn, as mentioned above, when I presumed that there was a version related issue: "I also tried to install an older version of elasticsearch but es.version is rejected with an IllegalArgumentException when using it on Windows."

Now trying your suggested way to provision Elasticsearch into HDFS I get the following response:

C:\apps\dist\elasticsearch-hadoop-2.1.0\dist>hadoop jar elasticsearch-yarn-2.1.0.jar -install-es hdfs.upload.dir=file:///apps/elasticsearch/
Abnormal execution:Invalid argument hdfs.upload.dir
java.lang.IllegalArgumentException: Invalid argument hdfs.upload.dir
at org.elasticsearch.hadoop.yarn.util.PropertiesUtils.fromCmdLine(PropertiesUtils.java:72)
at org.elasticsearch.hadoop.yarn.cli.YarnBootstrap.run(YarnBootstrap.java:85)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.elasticsearch.hadoop.yarn.cli.YarnBootstrap.main(YarnBootstrap.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

That's because you are not specifying the arguments properly - you should use = between the key and the value not space as indicated here

I'm sorry but I don't really understand this advice as long as I did use =

"dist>hadoop jar elasticsearch-yarn-2.1.0.jar -install-es hdfs.upload.dir=file:///apps/elasticsearch/"

key: hdfs.upload.dir

separator: =

value: file:///apps/elasticsearch/

@Sampaio sorry, I was reading your post by email and did not see the =. Looks like a bug or at least incorrect exception reporting.

Is there any advancement, in the subject?