Reading from Elasticsearch to Spark is very slow

Hi, I am using Spark 2.4.0, Elasticsearch 6.6.2, and elasticsearch-spark-20_2.11-6.8.1.jar as connector.

I am running Spark local mode and configured the memory to 8GB.
I have an Elasticsearch index with 14 million documents.

I want to load the whole index to a Spark DataFrame, so I am doing:

import org.apache.spark.sql.SQLContext        
import org.elasticsearch.spark.sql._

val sql = new SQLContext(sc)
val myDF = sql.esDF("my-index/my-type").cache()

println(myDF.count())

I can see the memory being filled bit by bit, which is expected because of the cache(), the memory is large enough to hold the entire data, but the process is extremely slow (over 2 hours).

Any hint is highly appreciated.

Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.