I am using elasticsearch-hadoop 5.6.2 to read from the elasticsearch 5.6.1 indices and aliases with Spark 2.3. The ES indices are configured to have 5 primary and 1 replica shards.
In my case, the RDD returned by JavaEsSpark.esRDD()
is heavily imbalanced in terms of the data contained by each partition. Below is an example of the # of documents per RDD partition :
partition number: 0, document count: 0
partition number: 1, document count: 0
partition number: 2, document count: 0
partition number: 3, document count: 355326
partition number: 4, document count: 355518
partition number: 5, document count: 0
partition number: 6, document count: 0
partition number: 7, document count: 0
partition number: 8, document count: 0
partition number: 9, document count: 355436
partition number: 10, document count: 356470
partition number: 11, document count: 0
partition number: 12, document count: 356520
partition number: 13, document count: 0
partition number: 14, document count: 0
Only 5 partitions (which seem to correspond to 5 ES shards) have the data and other partitions are empty.
I am using the default value (100000) for es.input.max.docs.per.partition
config. And my understanding for this config was that with the default value, no RDD partition should have more than 100000 documents. So I am wondering if my understanding is not correct for this config or this is a bug?