I was wondering how the bulk API works for massive indexing job on a multi node cluster.
For me they are two possibilities:
The Java client sends documents on a round robin basis to each node of the cluster. Then each node check the ID of each documents and reroute them if necessary to the correct shard.
The java client computes the shard id from the document id for each document and directly sends the document to the correct node.
If I have a look to the source code, I think the first approach is implemented, but it is kind of weird for me because I naively think that the first approach is more efficient...
You can also place a client node next to the client, and this will keep track of where shards are located and do the routing for you, avoiding unnecessary network hops.
Thank you for all your answers.
If I understood well (please tell me if I am wrong ):
If I use a "simple" bulk request from the java client, the documents will be sent from my java process on a round robin basis to data nodes and then nodes will create sub bulk-requests to forward documents to node they belongs to (based on routing rules https://www.elastic.co/guide/en/elasticsearch/guide/current/routing-value.html).
If I instanciate a client node in my ingesting process, then a client node will run locally on my ingesting server and create itself the sub bulk-requests to the correct data nodes, avoiding the transmission of documents over the network more than one time.
TransportClient does not hold cluster state, and uses one more hop to transmit bulk requests, to a cluster node that holds cluster state. This node works like a proxy node.
NodeClient already holds cluster state, it does not need a proxy node, and can use the state to find out the nodes where documents should be indexed.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.