How to disable auto discovery in elasticsearch.haddop


#1

Hi, I'm running an elasticsearch client in a Scala project and doing some tests in a local environnement with a proxy (setup in $httpProxy, and my local machines are in the -Dhttp.nonProxyHosts java opts since they are local).
I have configured my connector with this:

val client = ElasticClient.remote(settings, uri)

and only set the clusterName.
but when I run my integration test, i got a:

ERROR [11-20-2015 14:03:41,478] [XCI=] org.apache.spark.executor.Executor - Exception in task 3.0 in stage 1.0 (TID 7)
 org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [_nodes/transport] failed; server[null] returned [407|Proxy Authentication Required: [... this is a 407 proxy html page ...]   at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:335) ~[elasticsearch-spark_2.10-2.1.0.Beta4.jar:2.1.0.Beta4]
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:300) ~[elasticsearch-spark_2.10-2.1.0.Beta4.jar:2.1.0.Beta4]
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:304) ~[elasticsearch-spark_2.10-2.1.0.Beta4.jar:2.1.0.Beta4]
    at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:118) ~[elasticsearch-spark_2.10-2.1.0.Beta4.jar:2.1.0.Beta4]
    at org.elasticsearch.hadoop.rest.RestClient.discoverNodes(RestClient.java:100) ~[elasticsearch-spark_2.10-2.1.0.Beta4.jar:2.1.0.Beta4]
    at org.elasticsearch.hadoop.rest.InitializationUtils.discoverNodesIfNeeded(InitializationUtils.java:58) ~[elasticsearch-spark_2.10-2.1.0.Beta4.jar:2.1.0.Beta4]
    at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:371) ~[elasticsearch-spark_2.10-2.1.0.Beta4.jar:2.1.0.Beta4]
    at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:38) ~[elasticsearch-spark_2.10-2.1.0.Beta4.jar:2.1.0.Beta4]
    at org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEsWithMeta$1.apply(EsSpark.scala:87) ~[elasticsearch-spark_2.10-2.1.0.Beta4.jar:2.1.0.Beta4]
    at org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEsWithMeta$1.apply(EsSpark.scala:87) ~[elasticsearch-spark_2.10-2.1.0.Beta4.jar:2.1.0.Beta4]
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) ~[spark-core_2.10-1.4.0.jar:1.4.0]
    at org.apache.spark.scheduler.Task.run(Task.scala:70) ~[spark-core_2.10-1.4.0.jar:1.4.0]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) [spark-core_2.10-1.4.0.jar:1.4.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_80]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_80]
    at java.lang.Thread.run(Thread.java:745) [?:1.7.0_80]
TRACE [11-20-2015 14:03:41,480] [XCI=] org.elasticsearch.hadoop.rest.commonshttp.CommonsHttpTransport - Rx [...]
TRACE [11-20-2015 14:03:41,480] [XCI=] org.elasticsearch.hadoop.rest.commonshttp.CommonsHttpTransport - Closing HTTP transport to localhost:9200
TRACE [11-20-2015 14:03:41,480] [XCI=] org.elasticsearch.hadoop.rest.commonshttp.CommonsHttpTransport - Closing HTTP transport to localhost:9200

I'm pretty sure it's because the client want to discover other nodes but cannot reach them because it is behind a proxy.
What options am I missing? where should I add them?
Thanks.


(Costin Leau) #2

Looks like es.nodes.wan.only available in 2.2


#3

The problem is that it was working before when running a local InteliJ spark. So I should be able to solve it without upgrading. (perhaps...)
I'm now seing what could be wrong:
I have set one http client (ElasticClient.remote from sksamuel.elastic4s) to do some query and codes, and one spark client (ESSpark.saveToEsWithMetadata).
Both have their configuration, here are the one for org.elasticsearch.spark:
es.net.proxy.http.use.system.props to false
es.nodes.discovery to false
es.nodes.client.only to true
es.noes to the client (I need to put a list, didn't have time to look that up for now).
Before the error, when it was working with a local spark, I only used the http client from sksamuel.elastic4s. So it might just be a configuration error and maybe a conflict with the libraries? but what I find strange, it's org.elasticsearch.spark configuration doesn't configure the cluster name at all. I didn't find anything at all on either https://github.com/elastic/elasticsearch-hadoop or https://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html
is that normal?
I'm trying to solve the configuration problem for the moment.


(Costin Leau) #4

ES-Hadoop/Spark relies on REST and thus it doesn't need the cluster name, that is required for clients using the transport/java/node client (which effectively means the client becomes part of the cluster in one form or the other).


#5

Thanks to your answer, I manage to understand that the port was wrong, since it's via REST, so I put back 9200.
Then I found someone with the same problem: https://groups.google.com/forum/#!topic/elasticsearch/RQGz3DdmY24
and had the same result: re-setting node discovery to true fixed my problem.
I have no idea why, and the things I wanted is to disable it since we will have another elasticsearch cluster for logging the other application on the same network.
Should I be worried, or is it standard and it would not affect the two different cluster since I specified all the nodes under the es.nodes conf?


(Costin Leau) #6

I'm not sure what you are asking. You are confusing client routing with cluster discovery.
I recommend stepping back a bit and picking up the reference documentation, in particular the getting started chapter.
Don't rush going through it - it will not take too long and will explain the concepts behind ES architecture.


(system) #7