Hi,
We are testing pushing some of the data from our on premise bigdata cluster to the Elasticsearch service running on AWS. The version is 5.5.2. We want to do a POC project of using Spark + ElasticsearchHadoop. Our Spark cluster is running as Spark 1.6.2.
First, I following the example and document on my laptop, using the following code:
./spark-1.6.3-bin-hadoop2.6/bin/spark-shell --master local[2] --jars repository/org/elasticsearch/elasticsearch-spark-13_2.10/5.5.2/elasticsearch-spark-13_2.10-5.5.2.jar --conf spark.es.nodes=aws-public-hostname:80 --conf spark.es.nodes.wan.only=true
scala>import org.elasticsearch.spark._
scala>val test = Map("campaignId" -> 1, "subject" -> "subject test", "body" -> "body test")
scala>sc.makeRDD(Seq(test)).saveToEs("testIndex/testType")
It works, and here are the lessons I got:
- AWS Elasticsearch is using port 80, so I have to set "es.nodes=aws-public-hostname:80"
- Since it is on the cloud, so we have to set "es.nodes.wan.only=true"
So far so good. The only problem I need to test on our product environment is the proxy, as our nodes of BigData cluster need proxy to connect to internet.
So I login into one node, and start the spark shell as local spark, but I have problems to make it work. Below are steps I tried so far:
- Without proxy setting, I know it won't work. Same command as I run on my laptop
spark-shell --master local[2] --jars ~/elasticsearch-spark-13_2.10-5.5.2.jar --conf spark.es.nodes=aws-public-hostname:80 --conf spark.es.nodes.wan.only=true
And I will get the following error message:
17/10/31 16:50:35 ERROR NetworkClient: Node [aws-public-hostname:80] failed (Connection timed out); no other nodes left - aborting...
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:294)
- Set the http proxy:
spark-shell --master local[2] --jars ~/elasticsearch-spark-13_2.10-5.5.2.jar --conf spark.es.nodes=aws-public-hostname:80 --conf spark.es.nodes.wan.only=true --conf spark.es.net.proxy.http.host=proxy-host --conf spark.es.net.proxy.http.port=3128
But I got following error message:
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
.....................
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest:
null
at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:505)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:463)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:425)
This error happened on my laptop without setting "es.nodes.wan.only=true". But in this case, even I set "es.nodes.wan.only=true", I still get this error after I set the "es.net.proxy.http.host" and "es.net.proxy.http.port"
- Set the socket proxy:
spark-shell --master local[2] --jars ~/elasticsearch-spark-13_2.10-5.5.2.jar --conf spark.es.nodes=aws-public-hostname:80 --conf spark.es.nodes.wan.only=true --conf spark.es.net.proxy.socks.host=proxy-host --conf spark.es.net.proxy.socks.port=3128
Running the same simple spark code as I run on my laptop, and wait for more than 15 minutes (much longer than the previous test case), I got the following error message:
17/10/31 16:54:29 WARN CommonsHttpTransport: SOCKS proxy user specified but no/empty password defined - double check the [es.net.proxy.socks.pass] property
17/10/31 17:14:40 ERROR NetworkClient: Node [aws-public-hostname:80] failed (Connection reset); no other nodes left - aborting...
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[aws-public-hostname:80]]
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:461)
Here are what I observed:
- Test 3 failed exactly error message as Test 1, but failed after almost 15 minutes later.
- In the Test 3, I can see the proxy is being used, as shown the following logger in Test 3 only:
17/10/31 16:54:29 WARN CommonsHttpTransport: SOCKS proxy user specified but no/empty password defined - double check the [es.net.proxy.socks.pass] property - I know the proxy works, as I use in other projects, and when these test cases failed, I can check the proxy using following command without any issue:
curl -x proxy-host:3128 www.google.com
So what is the correct way in the Elasticsearch hadoop to set the proxy setting to work with the AWS Elasticsearch service?
Thanks