Is elasticsearch-spark reading from localhost if ES and Spark is running on the same node?


(Andreas Baakind) #1

How does elasticsearch-spark choose which es-node a spark-worker is going to access? Is it random? What if we are running elasticsearch and spark on the same node (Not on all nodes, but on some "aggregation" nodes. memory-issues?). Will elasticsearch-spark find out that a worker is running on the same node as the es-instance, and let the worker read from localhost instead?

I'm concerned that we will have a lot of data going from ES to Spark across the network for each aggregation. We are currently using ES as our primary store, which is why I'm asking.

The indexes I plan to read from (one index per aggregation) have approx 100.000.000 to 155.000.000 documents, and the total size of the indexes are between 33GB to 55GB. We will probably not end up with loading the whole index at once, but split it ut into logical parts based on our domain (which will reduce the number of documents for each aggregation down to approx 2.700.000 and below).


(Costin Leau) #2

See the architecture section. es-spark does not get to chose its workers, it simply tells spark where the data (ES shard) is located. Spark might or might not make use of it - it's a hint not a requirement.
And typically especially in jobs with multiple aggregate/reduce operations, after the data is initially read and processed once, the location is not relevant any more (as the reads happen from memory).


(system) #3