Is elasticsearch-spark reading from localhost if ES and Spark is running on the same node?

baakind · December 2, 2015, 1:27pm

How does elasticsearch-spark choose which es-node a spark-worker is going to access? Is it random? What if we are running elasticsearch and spark on the same node (Not on all nodes, but on some "aggregation" nodes. memory-issues?). Will elasticsearch-spark find out that a worker is running on the same node as the es-instance, and let the worker read from localhost instead?

I'm concerned that we will have a lot of data going from ES to Spark across the network for each aggregation. We are currently using ES as our primary store, which is why I'm asking.

The indexes I plan to read from (one index per aggregation) have approx 100.000.000 to 155.000.000 documents, and the total size of the indexes are between 33GB to 55GB. We will probably not end up with loading the whole index at once, but split it ut into logical parts based on our domain (which will reduce the number of documents for each aggregation down to approx 2.700.000 and below).

costin · December 8, 2015, 2:16pm

See the architecture section. es-spark does not get to chose its workers, it simply tells spark where the data (ES shard) is located. Spark might or might not make use of it - it's a hint not a requirement.
And typically especially in jobs with multiple aggregate/reduce operations, after the data is initially read and processed once, the location is not relevant any more (as the reads happen from memory).

Topic		Replies	Views
Elasticsearch-Hadoop Data Locality Elasticsearch	2	944	July 6, 2017
Spark read from elasticsearch and primary shards Elasticsearch es-hadoop	2	1412	July 6, 2017
Reading from Elasticsearch index using spark ( es-hadoop ) connectors Elasticsearch es-hadoop	2	1514	March 22, 2022
Elasticsearch + Spark read performance issues Elasticsearch es-hadoop	3	2279	May 24, 2016
Using Spark DataSource with ES Hadoop Elasticsearch es-hadoop	2	678	July 6, 2017

Is elasticsearch-spark reading from localhost if ES and Spark is running on the same node?

Related topics