The plugin es4Hadoop make an analogy between ES shard and Spark partition, but it's not good.
Most of Spark users, advice to have "small" partitions, about 512M, for instance, Spark cannot cache in disk more than 2Gb (a know issue, not fix yet), whereas, elasticsearch advice to have big shard (10 Go), this is not compatible.
We should be able to control how many partitions when reading from elasticsearch. And by default, you should use the same value then HDFS use (64M/128M == 1 partition), I think it's a good idea to follow that.
(repartition can help, but it's very resources consuming and can give OOM).
@ebuildy in 5.0 we introduced the parallel reader functionality to the connector. When reading from Elasticsearch on clusters running 5.0 and up, the connector will subdivide the shards into slices based on the number of documents in them. This is configurable using the es.input.max.docs.per.partition (default 100,000). This should let you split up the larger ES partitions into smaller Spark friendly partitions.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.