Spark, analogy between shard and partition is wrong

The plugin es4Hadoop make an analogy between ES shard and Spark partition, but it's not good.

Most of Spark users, advice to have "small" partitions, about 512M, for instance, Spark cannot cache in disk more than 2Gb (a know issue, not fix yet), whereas, elasticsearch advice to have big shard (10 Go), this is not compatible.

We should be able to control how many partitions when reading from elasticsearch. And by default, you should use the same value then HDFS use (64M/128M == 1 partition), I think it's a good idea to follow that.

(repartition can help, but it's very resources consuming and can give OOM).

@ebuildy in 5.0 we introduced the parallel reader functionality to the connector. When reading from Elasticsearch on clusters running 5.0 and up, the connector will subdivide the shards into slices based on the number of documents in them. This is configurable using the es.input.max.docs.per.partition (default 100,000). This should let you split up the larger ES partitions into smaller Spark friendly partitions.

Ha fantastic!

(Cannot believe that Spark still has the max 2GB bug :confused: :confused: )

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.