AWS Elasticsearch Service + Spark + proxy

Hi,

We are testing pushing some of the data from our on premise bigdata cluster to the Elasticsearch service running on AWS. The version is 5.5.2. We want to do a POC project of using Spark + ElasticsearchHadoop. Our Spark cluster is running as Spark 1.6.2.

First, I following the example and document on my laptop, using the following code:

./spark-1.6.3-bin-hadoop2.6/bin/spark-shell --master local[2] --jars repository/org/elasticsearch/elasticsearch-spark-13_2.10/5.5.2/elasticsearch-spark-13_2.10-5.5.2.jar --conf spark.es.nodes=aws-public-hostname:80 --conf spark.es.nodes.wan.only=true

scala>import org.elasticsearch.spark._
scala>val test = Map("campaignId" -> 1, "subject" -> "subject test", "body" -> "body test")
scala>sc.makeRDD(Seq(test)).saveToEs("testIndex/testType")

It works, and here are the lessons I got:

  1. AWS Elasticsearch is using port 80, so I have to set "es.nodes=aws-public-hostname:80"
  2. Since it is on the cloud, so we have to set "es.nodes.wan.only=true"

So far so good. The only problem I need to test on our product environment is the proxy, as our nodes of BigData cluster need proxy to connect to internet.

So I login into one node, and start the spark shell as local spark, but I have problems to make it work. Below are steps I tried so far:

  1. Without proxy setting, I know it won't work. Same command as I run on my laptop
    spark-shell --master local[2] --jars ~/elasticsearch-spark-13_2.10-5.5.2.jar --conf spark.es.nodes=aws-public-hostname:80 --conf spark.es.nodes.wan.only=true

And I will get the following error message:
17/10/31 16:50:35 ERROR NetworkClient: Node [aws-public-hostname:80] failed (Connection timed out); no other nodes left - aborting...
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:294)

  1. Set the http proxy:
    spark-shell --master local[2] --jars ~/elasticsearch-spark-13_2.10-5.5.2.jar --conf spark.es.nodes=aws-public-hostname:80 --conf spark.es.nodes.wan.only=true --conf spark.es.net.proxy.http.host=proxy-host --conf spark.es.net.proxy.http.port=3128
    But I got following error message:
    org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
    .....................
    Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest:
    null
    at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:505)
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:463)
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:425)

This error happened on my laptop without setting "es.nodes.wan.only=true". But in this case, even I set "es.nodes.wan.only=true", I still get this error after I set the "es.net.proxy.http.host" and "es.net.proxy.http.port"

  1. Set the socket proxy:
    spark-shell --master local[2] --jars ~/elasticsearch-spark-13_2.10-5.5.2.jar --conf spark.es.nodes=aws-public-hostname:80 --conf spark.es.nodes.wan.only=true --conf spark.es.net.proxy.socks.host=proxy-host --conf spark.es.net.proxy.socks.port=3128
    Running the same simple spark code as I run on my laptop, and wait for more than 15 minutes (much longer than the previous test case), I got the following error message:
    17/10/31 16:54:29 WARN CommonsHttpTransport: SOCKS proxy user specified but no/empty password defined - double check the [es.net.proxy.socks.pass] property
    17/10/31 17:14:40 ERROR NetworkClient: Node [aws-public-hostname:80] failed (Connection reset); no other nodes left - aborting...
    org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
    Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[aws-public-hostname:80]]
    at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:461)

Here are what I observed:

  1. Test 3 failed exactly error message as Test 1, but failed after almost 15 minutes later.
  2. In the Test 3, I can see the proxy is being used, as shown the following logger in Test 3 only:
    17/10/31 16:54:29 WARN CommonsHttpTransport: SOCKS proxy user specified but no/empty password defined - double check the [es.net.proxy.socks.pass] property
  3. I know the proxy works, as I use in other projects, and when these test cases failed, I can check the proxy using following command without any issue:
    curl -x proxy-host:3128 www.google.com

So what is the correct way in the Elasticsearch hadoop to set the proxy setting to work with the AWS Elasticsearch service?

Thanks

OK. I think we may find out the root cause.

The output error message in stacktrace is following:

Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest:
null
at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:507)

In this case, Elasticsearch code failed to get the root message from the AWS response. so I have to hack into the code: https://github.com/elastic/elasticsearch-hadoop/blob/v5.5.2/mr/src/main/java/org/elasticsearch/hadoop/rest/RestClient.java#L482

to get the real message in the body as following:
{"Message":"User: anonymous is not authorized to perform: es:ESHttpGet on resource: xxx"}

My guess is the above body from AWS doesn't meet the expectation of Elasticsearch hadoop library, so it failed to show us the root message.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.