I was wondering how the bulk API works for massive indexing job on a multi node cluster.
For me they are two possibilities:
The Java client sends documents on a round robin basis to each node of the cluster. Then each node check the ID of each documents and reroute them if necessary to the correct shard.
The java client computes the shard id from the document id for each document and directly sends the document to the correct node.
If I have a look to the source code, I think the first approach is implemented, but it is kind of weird for me because I naively think that the first approach is more efficient...
If I instanciate a client node in my ingesting process, then a client node will run locally on my ingesting server and create itself the sub bulk-requests to the correct data nodes, avoiding the transmission of documents over the network more than one time.