Spark, analogy between shard and partition is wrong

ebuildy · April 10, 2017, 4:16pm

The plugin es4Hadoop make an analogy between ES shard and Spark partition, but it's not good.

Most of Spark users, advice to have "small" partitions, about 512M, for instance, Spark cannot cache in disk more than 2Gb (a know issue, not fix yet), whereas, elasticsearch advice to have big shard (10 Go), this is not compatible.

We should be able to control how many partitions when reading from elasticsearch. And by default, you should use the same value then HDFS use (64M/128M == 1 partition), I think it's a good idea to follow that.

(repartition can help, but it's very resources consuming and can give OOM).

james.baiera · April 13, 2017, 4:32pm

@ebuildy in 5.0 we introduced the parallel reader functionality to the connector. When reading from Elasticsearch on clusters running 5.0 and up, the connector will subdivide the shards into slices based on the number of documents in them. This is configurable using the es.input.max.docs.per.partition (default 100,000). This should let you split up the larger ES partitions into smaller Spark friendly partitions.

ebuildy · April 26, 2017, 7:10am

Ha fantastic!

(Cannot believe that Spark still has the max 2GB bug )

system · May 24, 2017, 7:21am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Understanding "es.input.max.docs.per.partition" configuration Elasticsearch es-hadoop	3	2772	November 22, 2018
ES/Spark RDD Partitioning Strangeness Elasticsearch es-hadoop	4	793	July 22, 2020
Spark uses one ES node at a time to write to elastic search Elasticsearch es-hadoop	4	1848	November 8, 2017
Size exceed s Integer.MAX_VALUE Elasticsearch es-hadoop	6	3192	July 6, 2017
ES-Hadoop peer-to-peer architecture Elasticsearch es-hadoop	6	1492	July 6, 2017

Spark, analogy between shard and partition is wrong

Related topics