Getting a "No data nodes with HTTP-enabled available" error when writing from Spark to elasticsearch on Google Dataproc


(Ben Weisburd) #1

I'm trying to export data from Spark => elasticsearch cluster running on Google Container Engine (GKE). I've deployed an ES cluster using configs from https://github.com/pires/kubernetes-elasticsearch-cluster/tree/master/stateful that create a couple of each node type: master, client, data.

I'm able to insert data into ES through the Spark connector if I have it connect to one of the Client nodes and set

es.nodes.client.only=true

After reading
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html#_network and
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/cloud.html
though, I'd like to have Spark write directly to the data nodes. However, if I switch back to the default es.nodes.client.only=false

I get this error:

	at java.lang.Thread.run(Thread.java:748)org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: No data nodes with HTTP-enabled available
	at org.elasticsearch.hadoop.rest.InitializationUtils.filterNonDataNodesIfNeeded(InitializationUtils.java:157)
	at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:576)
	at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
	at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:91)
	at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:91)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
	at org.apache.spark.scheduler.Task.run(Task.scala:86)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

Error summary: EsHadoopIllegalArgumentException: No data nodes with HTTP-enabled available

(James Baiera) #2

I'm not too well versed in Docker, but taking a look through the configurations for the linked deployment I found these lines which are probably responsible for the lack of HTTP availability:

I also only see network configurations for transport level traffic (port 9300, non-http). I don't think this configuration is meant to be used in the manner that you are describing.


(Ben Weisburd) #3

Thanks, toggling that to "true" does fix it.


(Ben Weisburd) #4

I'm not sure how much of a difference this makes, but isn't it suboptimal that the Elasticsearch Spark connector communicates with elasticsearch using HTTP rather than the TCP protocol on 9300?


(James Baiera) #5

TCP protocol is not backwards compatible between versions of Elasticsearch the same way that HTTP is. There have also been a fair amount of benchmarks performed on HTTP vs RPC and they have found that the two have comparable performance characteristics.


(Ben Weisburd) #6

Got it. Thanks.


(ankara temizlik şirketleri) #7

Thank you for sharing!

ankara temizlik şirketleri


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.