Spark read from elasticsearch and primary shards

vincent_gromakowski · March 23, 2016, 1:24pm

I would like to know how the shards are balanced when reading from Spark. I have checked the documentation where I can see Spark can read from both primary shards or replica but it's unclear how the mapping is done and if shards are well balanced against Spark tasks. Can you give some explanation on how the connector ensure multiple shards index are correctly balanced in Spark partitions ?
For instance, if I have a 4 shards index with 1 replication level on 4 ES nodes, how can I be sure that colocated Spark worker will get each one local shard even if 2 primary shards are located on the same ES node ?

costin · April 5, 2016, 2:31pm

You can't. Neither Hadoop nor Spark offer any guarantees on where a certain task will execute. In fact any information from an RDD regarding a shard/partition location is provided through a method which is called preferredLocation. That is, that information is a hint - it might be or not be used.

Further more, ES-Hadoop/Spark has no information (nor should it have) on what nodes Spark is running - it only knows on what nodes the data in ES resides and it returns the data based on the search_shard API which load balances across the various nodes in a merry go around fashion.

As far as ES is concerned, for reading purposes a primary and a replica are identical. Trying to lock down certain operations to a dedicated host is error prone since the activity against a cluster is not static.

Topic		Replies	Views
Is elasticsearch-spark reading from localhost if ES and Spark is running on the same node? Elasticsearch es-hadoop	2	1198	July 6, 2017
Balanced shards and replicas in ES Elasticsearch	3	448	July 6, 2017
Primary shard rebalancing Elasticsearch	5	349	July 6, 2017
Node with only replica shards and no primary Elasticsearch	1	784	July 5, 2017
Primary shards not balanced across nodes for elasticsearch 5.2 Elasticsearch	2	739	August 23, 2018

Spark read from elasticsearch and primary shards

Related topics