I'm using version 2.2.0 of the elasticsearch-hadoop library. My ES cluster is hosted on Amazon Elasticsearch Service (https://aws.amazon.com/elasticsearch-service/).
There is only one exposed URL to hit the cluster (https://myurl.com).
I've modified the access policy on AWS to allow requests from specific IPs and I can confirm this works because I am able to send a HTTP request and get the response back from the IPs.
After this set up, in an Apache pig script is unable to connect to the instance through the given URL. The error is:
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot resolve ip for hostname: myurl.com
at org.elasticsearch.hadoop.util.SettingsUtils.resolveHostToIpIfNecessary(SettingsUtils.java:78)
at org.elasticsearch.hadoop.util.SettingsUtils.qualifyNodes(SettingsUtils.java:40)
at org.elasticsearch.hadoop.util.SettingsUtils.declaredNodes(SettingsUtils.java:118)
at org.elasticsearch.hadoop.util.SettingsUtils.discoveredOrDeclaredNodes(SettingsUtils.java:124)
at org.elasticsearch.hadoop.rest.NetworkClient.<init>(NetworkClient.java:58)
at org.elasticsearch.hadoop.rest.RestClient.<init>(RestClient.java:84)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:174)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:378)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
at org.elasticsearch.hadoop.pig.EsStorage.putNext(EsStorage.java:192)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:285)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
I am able to send HTTP requests to the URL but I am unable to ping the URL itself, which explains why it is unable to resolve the IP. At this point, I'm not sure how I'm supposed to connect to the cluster on AWS. Any help would be greatly appreciated.
@shaunak Formatted your post (again) to make it readable. Please do so yourself in the future - the easier it is to read, the higher the chances for a proper response.
Your post doesn't contain any information about the setup - the stacktrace indicates your assessment is correct but that's about it.
What does your PIG configuration looks like (the setup for the ES cluster)? Try turning on logging on the REST package as explained here to see what connections are being made.
Outside the official testing, there are several users that are using the integration against an ES cluster on AWS.
Is your PIG job running locally or not? If you can send HTTP requests to the URL, do you have a proxy setup by any chance?
What error does ping give? Have you tried traceroute/tracert?
Apologies! I'll try to be more verbose in this post.
I am running pig locally (pig -x local). I'm not sure what you mean by PIG configuration (setup for ES cluster) means? In the script, I make a call using the library using the command below. I don't have any other configuration as such. I am just trying to connect to the ES cluster using the library within PIG with the call here:
STORE B INTO 'radio/artists' USING org.elasticsearch.hadoop.pig.EsStorage('es.nodes=https://url.com');
Turning on logging in debug mode does not give any useful information, as far as I can tell. It writes down the version of the library it's using which is v.2.2.0 and
DEBUG pig.EsStorage: Using pre-defined writer serializer [org.elasticsearch.hadoop.pig.PigValueWriter] as default
The error ping gives is 'cannot resolve https://url.com: Unknown host. Traceroute gives an error saying the hostname is too long. Yes, there is a proxy but I don't believe it blocks outgoing ping requests.
I tried using the direct IP of the URL I hit on AWS, and then in that case the library says 'Connection Timed out, retrying request' and ping also gives an output saying no response from the IP. I can't really use the IP because AWS is probably using a load balancer to direct requests to different hosts.
It looks like you have a network misconfiguration. Likely your proxy configuration makes the request work in browser (since it is the one using it) while the rest of the tools do not since they don't know the host itself.
Getting the IP should not be a problem - AWS does use an external IP for the given host and changing it on each request is not really a good idea (though of course, it can happen).
Investigate what is your proxy configuration in the browser and configure ES-Hadoop accordingly.
Do note that you need to set the cloud/wan configuration as well (it doesn't seem to be in there).
Also if it helps, I am running this on a RHEL VM on a desktop and then SSH-ing into it and running my scripts on it. Both machines are on the same network however (Stupid setup, don't ask why).
So, I am able to curl on the command line as well to the host and I get a response from ES. Would it still be a network misconfiguration?
Likely your VM doesn't properly route requests outside the host or its DNS is misconfigured. This is typically the case with Hadoop VMs (many times one can't even download new packages onto it).
Try to poke around and fix its DNS or rather nameserver to reach to the outside world - it should be able to do so without SSH or anything like that.
Likely your distro documentation has some pointers. Any other solution would be way too complicated and based on what you write, it might be easier for you.
Ok, so there was a problem with my URL. I had to remove the trailing slash at the end for it to resolve the IP.
The host command correctly identifies the IP and so does elasticsearch-hadoop. But now, the problem is the library is timing out on the request and it says this:
16/01/15 11:03:08 INFO httpclient.HttpMethodDirector: I/O exception (java.net.ConnectException) caught when processing request: Connection timed out
16/01/15 11:03:08 INFO httpclient.HttpMethodDirector: Retrying request
Is this because the the data I'm writing may be too big to send over one request? Or does the library break it down into chunks?
EDIT: Not because of size of data. I get the error even with an extremely small dataset.
Your VM is misconfigured and doesn't allow internet access to the outside world. Try connecting to google, etc - forget about Pig, ES-Hadoop, etc.. Just try to connect from the VM to any public site - once that's working, you can resume.
Sorry for the late reply, and yes I was able to connect by specifying the port for HTTP (80) or HTTPS (443). I did not catch the fact that it defaults to 9200 for the port. I also had to set it to wan.only.mode. Thanks for all the help.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.