I want to use spark to query ES. But I've a doubt about how data is transfered back and forth between ES nodes and Spark nodes on a large cluster setup and how node affinity between both world works.
We are setting up a large cluster of machine where every node will run both a spark instance and an ES instance.
When we launch a big query from spark, is the es-hadoop connector able to understand the topology (shards distribution) of the both clusters?
Our most important concern here is to avoid a query from the Spark cluster to the ES cluster to first require the fetching of the whole ES dataset to a single node before that data being redistributed to the very same machines for Spark processing.
For that, we need to be sure that a spark job doing a simple-yet-big query (no aggregation !) will see the Spark RDD distribution on the cluster to be driven by the way ES has distributed the corresponding indexes on the cluster.
In other words, every Spark node should receive the partial data that is collocated on the same machine in the ES node. This way, the ES data is transformed into a Spark RDD without needing anything to be transfered on the network.
Every ES Node should know about and communicate with its collocated Spark Node (we calls this, perhapd wrongly, "node affinity" between both worlds)
Can you confirm that such a mechanism exist ? And if that is the case where can we find information about what are the constraints and conditions to make it work ?
We have an understanding that such an "affinity" system exist between ES and Hadoop, for instance when using Hive to query ES data.