Read ES Index from Spark Executors

NiravVira · May 16, 2016, 4:09am

Infrastructure
ES: 2.2.0
Spark: 1.6
Scala: 2.10
ElasticHadoop: elasticsearch-hadoop 2.2.0 ( i can use elasticsearch-spark-2.10 if needed)
Kafka: 0.9.0
Host: CDH5.7 VM
Spark Streaming Job Desc: Read data from Kafka topic. for unique Id, read ES index/type data from this correlation, aggregate & store in another index/type for search.

SparkContext is created on driver along with ES settings.

able to read ES fine as long as I do .collect & send the execution to Spark Driver.

Question: How to make the read from Spark Executor so I can leverage the parallelism. Googling suggests use of ConnectionPool at rdd partition level for connection to likes of Cassandra, etc.

How to get a connection pool going for ES, any concrete impl / documentation for connection pool that can be used on Executor.

code snippet

val stream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, offsets, messageHandler)

stream
  .foreachRDD { rdd =>
    if (!rdd.isEmpty) {
      ...
      /* //this works fine so connectivity from driver to ES is good
      val myrdd = sc.esRDD("somindex/sometype ", "?q=EntityId:993b0000-e516-6c3b-79a9-08d349e3fd34")
      myrdd.foreach(x => println("item details" + x._2))
      */
      rdd.collect().foreach(item => {
        ... some operation
        val entityId = kafkaMessage.entityId //kafkaMessage resolved thru code not shown here
        val esDataRDD = sc.esRDD("someindex/sometype", "?q=EntityId:"+entityId)

Nicolas_Phung · May 20, 2016, 4:45pm

Hello,

I'm interested too if this best practices can be applied with elasticsearch-spark. It looks like we are running into some troubles with several Spark Streaming to write in Elasticsearch with elasticsearch-spark 2.3.1 :

code snippet

16/05/19 14:59:19 WARN TaskSetManager: Lost task 2.0 in stage 1122.0 (TID 11389, slave04.local): org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:190)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:379)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopTransportException: java.net.BindException: Adresse déjà utilisée
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:121)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:434)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:414)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:418)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:122)
at org.elasticsearch.hadoop.rest.RestClient.esVersion(RestClient.java:564)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:178)
... 10 more
Caused by: java.net.BindException: Adresse déjà utilisée
at java.net.PlainSocketImpl.socketBind(Native Method)
at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:382)
at java.net.Socket.bind(Socket.java:644)
at java.net.Socket.(Socket.java:433)
at java.net.Socket.(Socket.java:286)
at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at org.elasticsearch.hadoop.rest.commonshttp.CommonsHttpTransport.execute(CommonsHttpTransport.java:468)
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:104)
... 16 more

This happens when we have many Spark Streaming jobs and it seems to impact existing jobs that uses elasticsearch-spark to write into Elasticsearch. Does someone has any idea ?

Regards,

costin · May 24, 2016, 10:28am

Streaming is a special case since conceptually it's a long running process but Spark is actually batching so the whole micro-batching strategy means a series of small batch process/tasks which keep on writing to ES.
As there's no proper API for that, the Spark docs suggests to keep the connections alive and pass them to the batch in order to keep creating them.
Which works if there's a programmatic approach (open, connection, write some data, check, close) but fails with a declarative one (take this data and save it).
Spark 2 looks to be introducing/changing some APIs and once it is finalized, we'll look into providing hooks for reusing connections.

NiravVira · May 25, 2016, 5:04am

Costin, Thanks for your reply. My use case is about "read" ES as well. I am wondering if there are current options (w or w/o connection pool) to call .esRDD without collect & there by running the code on spark executors. Any samples are greatly appreciated.

Topic		Replies	Views
Newbie question about Spark and Elasticsearch Elasticsearch	5	450	July 6, 2017
Error Address already in use when I use ElasticSearch with Spark Elasticsearch es-hadoop	6	6630	July 6, 2017
Getting error when invoking elasticSearch from spark Elasticsearch es-hadoop	6	2111	July 6, 2017
Es spark with local elasticsearch listening on non default ports Elasticsearch	2	933	July 6, 2017
How to parallelize ES load operation in Spark using the connector lib? Elasticsearch es-hadoop	5	1460	May 6, 2019

Read ES Index from Spark Executors

Related topics