Hi, I am using Spark 2.4.0, Elasticsearch 6.6.2, and elasticsearch-spark-20_2.11-6.8.1.jar as connector.
I am running Spark local mode and configured the memory to 8GB.
I have an Elasticsearch index with 14 million documents.
I want to load the whole index to a Spark DataFrame, so I am doing:
import org.apache.spark.sql.SQLContext
import org.elasticsearch.spark.sql._
val sql = new SQLContext(sc)
val myDF = sql.esDF("my-index/my-type").cache()
println(myDF.count())
I can see the memory being filled bit by bit, which is expected because of the cache(), the memory is large enough to hold the entire data, but the process is extremely slow (over 2 hours).
Any hint is highly appreciated.
Thanks