I would like to know how the shards are balanced when reading from Spark. I have checked the documentation where I can see Spark can read from both primary shards or replica but it's unclear how the mapping is done and if shards are well balanced against Spark tasks. Can you give some explanation on how the connector ensure multiple shards index are correctly balanced in Spark partitions ?
For instance, if I have a 4 shards index with 1 replication level on 4 ES nodes, how can I be sure that colocated Spark worker will get each one local shard even if 2 primary shards are located on the same ES node ?
You can't. Neither Hadoop nor Spark offer any guarantees on where a certain task will execute. In fact any information from an RDD regarding a shard/partition location is provided through a method which is called preferredLocation
. That is, that information is a hint - it might be or not be used.
Further more, ES-Hadoop/Spark has no information (nor should it have) on what nodes Spark is running - it only knows on what nodes the data in ES resides and it returns the data based on the search_shard API which load balances across the various nodes in a merry go around fashion.
As far as ES is concerned, for reading purposes a primary and a replica are identical. Trying to lock down certain operations to a dedicated host is error prone since the activity against a cluster is not static.