Getting a "No data nodes with HTTP-enabled available" error when writing from Spark to elasticsearch on Google Dataproc

I'm trying to export data from Spark => elasticsearch cluster running on Google Container Engine (GKE). I've deployed an ES cluster using configs from https://github.com/pires/kubernetes-elasticsearch-cluster/tree/master/stateful that create a couple of each node type: master, client, data.

I'm able to insert data into ES through the Spark connector if I have it connect to one of the Client nodes and set

es.nodes.client.only=true

After reading
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html#_network and
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/cloud.html
though, I'd like to have Spark write directly to the data nodes. However, if I switch back to the default es.nodes.client.only=false

I get this error:

	at java.lang.Thread.run(Thread.java:748)org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: No data nodes with HTTP-enabled available
	at org.elasticsearch.hadoop.rest.InitializationUtils.filterNonDataNodesIfNeeded(InitializationUtils.java:157)
	at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:576)
	at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
	at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:91)
	at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:91)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
	at org.apache.spark.scheduler.Task.run(Task.scala:86)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

Error summary: EsHadoopIllegalArgumentException: No data nodes with HTTP-enabled available

I'm not too well versed in Docker, but taking a look through the configurations for the linked deployment I found these lines which are probably responsible for the lack of HTTP availability:

I also only see network configurations for transport level traffic (port 9300, non-http). I don't think this configuration is meant to be used in the manner that you are describing.

Thanks, toggling that to "true" does fix it.

I'm not sure how much of a difference this makes, but isn't it suboptimal that the Elasticsearch Spark connector communicates with elasticsearch using HTTP rather than the TCP protocol on 9300?

TCP protocol is not backwards compatible between versions of Elasticsearch the same way that HTTP is. There have also been a fair amount of benchmarks performed on HTTP vs RPC and they have found that the two have comparable performance characteristics.

Got it. Thanks.

Thank you for sharing!

ankara temizlik şirketleri

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.