Routing - Massive injection with bulk API

xam · March 1, 2016, 9:34am

Hi,

I was wondering how the bulk API works for massive indexing job on a multi node cluster.
For me they are two possibilities:

The Java client sends documents on a round robin basis to each node of the cluster. Then each node check the ID of each documents and reroute them if necessary to the correct shard.
The java client computes the shard id from the document id for each document and directly sends the document to the correct node.
If I have a look to the source code, I think the first approach is implemented, but it is kind of weird for me because I naively think that the first approach is more efficient...

Thank you for your answer

warkolm · March 1, 2016, 9:45am

This method, but see Routing a Document to a Shard | Elasticsearch: The Definitive Guide [2.x] | Elastic for the exact way the shard ID is calculated.

Christian_Dahlqvist · March 1, 2016, 9:56am

You can also place a client node next to the client, and this will keep track of where shards are located and do the routing for you, avoiding unnecessary network hops.

jprante · March 1, 2016, 9:58am

This is how bulk requests work:

Java client sends off bulk request in round robin manner to connected cluster nodes
connected cluster node receives bulk request
location of primary shards of all bulk items are calculated from cluster state
bulk request is partitioned into new bulk requests for nodes of primary shard locations
connected cluster node forwards bulk request parts to these nodes

dadoonet · March 1, 2016, 9:58am

+1

We just modified the Java documentation to describe this a bit: https://www.elastic.co/guide/en/elasticsearch/client/java-api/2.x/client-connected-to-client-node.html

xam · March 1, 2016, 10:11am

Hi,

Thank you for all your answers.
If I understood well (please tell me if I am wrong ):

If I use a "simple" bulk request from the java client, the documents will be sent from my java process on a round robin basis to data nodes and then nodes will create sub bulk-requests to forward documents to node they belongs to (based on routing rules https://www.elastic.co/guide/en/elasticsearch/guide/current/routing-value.html).
If I instanciate a client node in my ingesting process, then a client node will run locally on my ingesting server and create itself the sub bulk-requests to the correct data nodes, avoiding the transmission of documents over the network more than one time.

Thank you

jprante · March 1, 2016, 10:19am

Correct.

TransportClient does not hold cluster state, and uses one more hop to transmit bulk requests, to a cluster node that holds cluster state. This node works like a proxy node.

NodeClient already holds cluster state, it does not need a proxy node, and can use the state to find out the nodes where documents should be indexed.

xam · March 1, 2016, 3:09pm

Ok thanks :-). It is clear now.

Topic		Replies	Views
Document ALWAYS Routes to ONE Shard when BULK Loading - HELP! Elasticsearch	18	1504	July 6, 2017
Bulk request routing Elasticsearch	2	1041	July 6, 2017
Java client connection question Elasticsearch	2	309	July 6, 2017
Improve indexing performance speed by routing to a specific shard Elasticsearch	8	210	March 27, 2023
Load balancing issues on cluster while indexing Elasticsearch	3	361	July 6, 2017

Routing - Massive injection with bulk API

Related topics