Routing - Massive injection with bulk API

Hi,

I was wondering how the bulk API works for massive indexing job on a multi node cluster.
For me they are two possibilities:

  • The Java client sends documents on a round robin basis to each node of the cluster. Then each node check the ID of each documents and reroute them if necessary to the correct shard.
  • The java client computes the shard id from the document id for each document and directly sends the document to the correct node.
    If I have a look to the source code, I think the first approach is implemented, but it is kind of weird for me because I naively think that the first approach is more efficient...

Thank you for your answer

This method, but see Routing a Document to a Shard | Elasticsearch: The Definitive Guide [2.x] | Elastic for the exact way the shard ID is calculated.

You can also place a client node next to the client, and this will keep track of where shards are located and do the routing for you, avoiding unnecessary network hops.

This is how bulk requests work:

  1. Java client sends off bulk request in round robin manner to connected cluster nodes
  2. connected cluster node receives bulk request
  3. location of primary shards of all bulk items are calculated from cluster state
  4. bulk request is partitioned into new bulk requests for nodes of primary shard locations
  5. connected cluster node forwards bulk request parts to these nodes

+1

We just modified the Java documentation to describe this a bit: https://www.elastic.co/guide/en/elasticsearch/client/java-api/2.x/client-connected-to-client-node.html

Hi,

Thank you for all your answers.
If I understood well (please tell me if I am wrong :slight_smile: ):

  • If I use a "simple" bulk request from the java client, the documents will be sent from my java process on a round robin basis to data nodes and then nodes will create sub bulk-requests to forward documents to node they belongs to (based on routing rules https://www.elastic.co/guide/en/elasticsearch/guide/current/routing-value.html).
  • If I instanciate a client node in my ingesting process, then a client node will run locally on my ingesting server and create itself the sub bulk-requests to the correct data nodes, avoiding the transmission of documents over the network more than one time.

Thank you

Correct.

TransportClient does not hold cluster state, and uses one more hop to transmit bulk requests, to a cluster node that holds cluster state. This node works like a proxy node.

NodeClient already holds cluster state, it does not need a proxy node, and can use the state to find out the nodes where documents should be indexed.

Ok thanks :-). It is clear now.