ES/Spark RDD Partitioning Strangeness

ikiril01 · June 23, 2020, 5:12pm

We're using the es-hadoop connector to dump data from an ES index to spark (w/ pyspark, newAPIHadoopRDD). The issue we're seeing is that the RDD that is returned is skewed in terms of partitions - it always returns 9 partitions, with only 3 of them being populated.

Is there any way to have newAPIHadoopRDD return more evenly distributed partitions from the ES data? I realize that I can repartition them after the fact, but the initial set of partitions is causing some performance headaches.

wangqinghuan · June 24, 2020, 2:34am

Welcome, Ivan.
You can control the number of documents per partition by set es.input.max.docs.per.partition value.

ikiril01 · June 24, 2020, 1:11pm

Thanks Wang! That sounds like exactly what I need. I'll give it a shot.

ikiril01 · June 24, 2020, 4:33pm

I can confirm that es.input.max.docs.per.partition works well and solved my issue. Thanks again!

system · July 22, 2020, 4:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Understanding "es.input.max.docs.per.partition" configuration Elasticsearch es-hadoop	3	2778	November 22, 2018
Spark, analogy between shard and partition is wrong Elasticsearch es-hadoop	3	2024	May 24, 2017
saveToEs Write performance (elasticsearch-spark) Elasticsearch es-hadoop	3	2776	July 6, 2017
ES hadoop Spark query returns too many partitions Elasticsearch es-hadoop	1	412	January 17, 2021
ES-Hadoop peer-to-peer architecture Elasticsearch es-hadoop	6	1498	July 6, 2017

ES/Spark RDD Partitioning Strangeness

Related topics