Found - Hadoop integration issue

Hi there,

I've just signed up a trial account with Found and attempt to inject data through HiveQL using the elasticsearch-hadoop library like so:

ADD JAR some/url/elasticsearch-hadoop-2.2.0.jar

CREATE EXTERNAL TABLE es_event (
requestId STRING,
ipaddress STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'test',
'es.index.auto.create' = 'true',
'es.nodes' = 'found-host:9200';

Now every time I attempt to connect to Found, I always get EsHadoopInvalidRequest: null error. Looking a bit deeper into the code, the error seems to be caused by 404 Not Found response because I assume that the library attempts to hit the ES API using the ip address instead of the provided host url.

Manually hitting the API using the ip address through the browser returns this: {"ok":false,"message":"Unknown cluster."}

Can anyone help me with this issue please?

Hi @d3nyable,

Can you try setting the es.nodes property to your cluster url? You need to replace the found-host.

Thanks,
Igor

I did set the es.nodes value correctly according to my cluster URL like so:

'es.nodes' = 'xxxxxxxx.us-east-1.aws.found.io:9200'

I didnt write it like that initially because I thought it was obvious enough and didn't see the point of exposing the real URL. Any other suggestions?

Hi,

The URL indeed looks ok.

Do you use shield in your cluster? If so then you'll also need to provide elasticsearch user and password, as described in Security | Elasticsearch for Apache Hadoop [8.11] | Elastic

Username/Password

Set these through es.net.http.auth.user and es.net.http.auth.pass properties.

If this will not help, can please you provide first 6 characters of your cluster ID? We will check if there is anything wrong with the cluster.

Thanks,
Igor

Hi @igor_k

Hitting the cluster host URL (e.g. http://xxxx.us-east-1.aws.found.io:9200/) brings back result like so:

{ "status" : 200, "name" : "instance-0000000000", "cluster_name" : "xxxxx", "version" : { "number" : "1.7.5", "build_hash" : "xxxx", "build_timestamp" : "2016-02-02T09:55:30Z", "build_snapshot" : false, "lucene_version" : "4.10.4" }, "tagline" : "You Know, for Search" }

but when I hit the remote ip address instead (e.g. x.x.x.x:9200), it returns with:
{"ok":false,"message":"Unknown cluster."}

I think that is the cause of the issue, the elasticsearch-hadoop library resolves the URL in es.nodes parameter to its remote ip address and use it to make the HTTP request.

For cloud/wan environments, es-hadoop 2.2 introduced the wan option. It works with ES 1.x and 2.x. Please see the es-hadoop docs for more info.

Just for the record, here is the link

Hi @costin

Yeah i know about that setting and had also tried turning it on with no luck. The library is still making http requests using the remote ip address instead of the host url.

It is weird that Found does not like it and returns with 404 instead. FYI I also tried AWS new ES Service and it works fine. Hitting either the remote ip or the URL returns back 200 with the expected JSON body.

I actually found the cause of this issue, which is SettingsUtils.resolveHostToIpIfNecessary() as it attempts to resolve host URL to ip address.

Making http request using the remote ip address upsets Found and returns 404 as the request is missing the host header.

Is there a reason why such logic is done?

See https://github.com/elastic/elasticsearch-hadoop/issues/640
Basically hostnames or IPs represent the same thing - hosts. So when dealing with lookups, one needs to resolve them to one single form that can be used for comparison. Hostnames are notoriusly problematic since there can be multiple hostnames pointing to the same ip, there can be aliases, etc...
So using the IP instead is a great way of making things simpler.

@d3nyable A solution to your case would be making resolution optional and maybe even disabling it by default for wan cases.

FTR, linking the github issue to this post.

Should I make a pull request to disable the logic when wan setting is set to true?

Thanks for your help BTW.

PR are always welcome. I'll probably get around it by next week (currently travelling).

Cheers,

Hi igor_k,
I am new to this ELK stack and need your help regarding the Native Authentication.

While performing a ES-HIVE integration, per your response , we need to define "es.net.http.auth.user and es.net.http.auth.pass properties." as part of the table DDL. Quentions is how to secure the credentials as anyone having read access to table metadata/execution access for "show create table" can access the credentials.

Any help/guidance in this regard is appreciated.