Does elasticsearch-hadoop on spark has data locality on write?

ov7a · August 15, 2017, 2:15pm

Let's suppose we have a spark cluster with elasticsearch on each node.

We've created a rdd with documents somehow and want to write them to es.

Will the documents be written locally, i.e. if a partition is on node1, all documents from it will be sent to node1? And no documents will be transferred between hosts?

Reading the code (RestService):

//currentSplit is partitionId
// check invalid splits (applicable when running in non-MR environments) - in this case fall back to Random..
int selectedNode = (currentSplit < 0) ? new Random().nextInt(nodes.size()) : currentSplit % nodes.size();

This is just a round robin, and the data will be tossed around in network, If I understand this correctly.

james.baiera · August 15, 2017, 3:11pm

That is correct. Most write patterns do not end up with all data in a request going to a single node, and it's impossible for us to determine which shard a document will be hashed to at write time without sending the data to the cluster and letting it determine it. Even if an Elasticsearch node is local to the write process, the chances of that node needing to send data along the network is based on how many shards you have and how many nodes they're spread across.

As such, for write operations, we try to filter down the nodes that we write to so that they only include nodes that are hosting shards for the index we're writing. Since there's no way to know which node the data will end up on, we round robin the partitions across the target nodes to spread the load evenly. If we were to prefer local node writes, it would upset the balance of traffic to the target nodes, and thus can cause slowdowns in processing.

We could potentially allow a property to be set that skips the filtering process and prefers talking to a local node for all writes, allowing the local node to manage all the traffic control of data. Could you describe your current set up and whether it might take advantage of this?

ov7a · August 15, 2017, 4:01pm

That feature would be very nice!

I have two setups:

A cluster where each node has 1 ES instance running. Spark launched in standalone mode, 1 worker per node. Spark-master is located on one of the nodes.
Same, but cluster has 2 instances of ES on each node.

One use case for me is reading documents from kafka and writing them into ES. Kafka servers are also located one per node. So, with this feature on, all writes will be local. Currently round-robining produce a network load.

What additional information could I provide you about my setup?

james.baiera · August 15, 2017, 5:15pm

What is your distribution of shards over the cluster for Elasticsearch, and are you using dynamic resource writing in the connector?

ov7a · August 15, 2017, 5:38pm

The number of shards is greater than the number of of nodes, no filtering is enabled, so I think it is safe to assume that each node has at least one shard for each index.

The whole cluster is homogeneous, i.e. all nodes are almost indistinguishable.

I'm not using dynamic resource is es-hadoop, but I have a feature-request for a related issue: https://github.com/elastic/elasticsearch-hadoop/issues/991
Currently I create a rest-client for each partition and perform bulk requests for that.

system · September 12, 2017, 5:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch-Hadoop Data Locality Elasticsearch	2	944	July 6, 2017
Is elasticsearch-spark reading from localhost if ES and Spark is running on the same node? Elasticsearch es-hadoop	2	1198	July 6, 2017
Spark/Hadoop batch build shards/Lucene indices Elasticsearch es-hadoop	8	2456	July 6, 2017
Spark uses one ES node at a time to write to elastic search Elasticsearch es-hadoop	4	1803	November 8, 2017
Comparison between C*/Spark and ES/Spark concerning data locality Elasticsearch es-hadoop	3	755	July 6, 2017

Does elasticsearch-hadoop on spark has data locality on write?

Related topics