We're using the es-hadoop connector to dump data from an ES index to spark (w/ pyspark, newAPIHadoopRDD). The issue we're seeing is that the RDD that is returned is skewed in terms of partitions - it always returns 9 partitions, with only 3 of them being populated.
Is there any way to have newAPIHadoopRDD return more evenly distributed partitions from the ES data? I realize that I can repartition them after the fact, but the initial set of partitions is causing some performance headaches.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.