How does elasticsearch-spark choose which es-node a spark-worker is going to access? Is it random? What if we are running elasticsearch and spark on the same node (Not on all nodes, but on some "aggregation" nodes. memory-issues?). Will elasticsearch-spark find out that a worker is running on the same node as the es-instance, and let the worker read from localhost instead?
I'm concerned that we will have a lot of data going from ES to Spark across the network for each aggregation. We are currently using ES as our primary store, which is why I'm asking.
The indexes I plan to read from (one index per aggregation) have approx 100.000.000 to 155.000.000 documents, and the total size of the indexes are between 33GB to 55GB. We will probably not end up with loading the whole index at once, but split it ut into logical parts based on our domain (which will reduce the number of documents for each aggregation down to approx 2.700.000 and below).