ES/Spark RDD Partitioning Strangeness

We're using the es-hadoop connector to dump data from an ES index to spark (w/ pyspark, newAPIHadoopRDD). The issue we're seeing is that the RDD that is returned is skewed in terms of partitions - it always returns 9 partitions, with only 3 of them being populated.

Is there any way to have newAPIHadoopRDD return more evenly distributed partitions from the ES data? I realize that I can repartition them after the fact, but the initial set of partitions is causing some performance headaches.

Welcome, Ivan.
You can control the number of documents per partition by set es.input.max.docs.per.partition value.

Thanks Wang! That sounds like exactly what I need. I'll give it a shot.

I can confirm that es.input.max.docs.per.partition works well and solved my issue. Thanks again!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.