Hi, I have 2 related questions regarding routing write requests. Thanks in
advance for answering!
Question 1:
I saw this line in the EsOutputFormat class and I was wondering why:
(https://github.com/elasticsearch/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/mr/EsOutputFormat.java#L221)
(https://github.com/elasticsearch/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/rest/RestRepository.java#L239)
Map<Shard, Node> targetShards = repository.getWriteTargetPrimaryShards();
repository.close();
List<Shard> orderedShards = new ArrayList<Shard>(targetShards.keySet());
// make sure the order is strict
Collections.sort(orderedShards);
// if there's no task info, just pick a random bucket
if (currentInstance <= 0) {
currentInstance = new Random().nextInt(targetShards.size()) + 1;
}
int bucket = currentInstance % targetShards.size();
Shard chosenShard = orderedShards.get(bucket);
Node targetNode = targetShards.get(chosenShard);
// override the global settings to communicate directly with the target node
settings.setHosts(targetNode.getIpAddress()).setPort(targetNode.getHttpPort());
repository = new RestRepository(settings);
uri = SettingsUtils.nodes(settings).get(0);
I was trying to understand why this is being done. My understanding of ES
writes was that:
- You can send a write request to any node in ES
- That node will receive the request, use the document id, hash it and
based on which node is supposed to be the primary - The write will be re-routed to the node meant to be the primary for
that document id/hashcode -
If this is the case (like in any data grid like Hazelcast, Coherence
(http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/distrib-write.html)
or NoSQL stores Hbase, Cassandra etc), why the special logic to find the
primaries?- This primary may not even be the one that will handle the write for
my document
- This primary may not even be the one that will handle the write for
-
Is this being done simply to load balance writes uniformly across the
(https://github.com/elasticsearch/elasticsearch-hadoop/edit/1.3/docs/src/reference/asciidoc/core/arch.adoc)
cluster? In that case why not do a simple round robin?
*Question 2:*Is there a way to hook into the routing logic of the cluster?
- Instead of picking up documents in bulk, sending them to a node in ES
- Which will then again resend the documents to the primaries for
different documents - Is there a way to find out which node is the primary for a document
- Collect all documents meant to go to a specific primary and send them
directly to that node?- (Like Hazelcast or Coherence's Partition awareness:
http://www.hazelcast.org/docs/3.0/manual/html-single/#DataAffinity) - The problem with this might be that if the primary dies, then the
sender has to be aware of the topology change and then resend the data to
the new primary
- (Like Hazelcast or Coherence's Partition awareness:
- So, we're not talking simple clients but smart-client libraries.
Perhaps I should be asking, does the Java client library already do this?
(I didn't think so) - Overall, wouldn't this save a lot of data re-routing over the network?
Thanks.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6996e823-0b41-42f8-b473-1afe0a8a4c1b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.