Hi, I am using Spark 2.4.0, Elasticsearch 6.6.2, and elasticsearch-spark-20_2.11-6.8.1.jar as connector.
I am running Spark local mode and configured the memory to 8GB.
I have an Elasticsearch index with 14 million documents.
I want to load the whole index to a Spark DataFrame, so I am doing:
import org.apache.spark.sql.SQLContext
import org.elasticsearch.spark.sql._
val sql = new SQLContext(sc)
val myDF = sql.esDF("my-index/my-type").cache()
println(myDF.count())
I can see the memory being filled bit by bit, which is expected because of the cache()
, the memory is large enough to hold the entire data, but the process is extremely slow (over 2 hours).
Any hint is highly appreciated.
Thanks