I have been running some performance tests with some bulk indexing
requests. One thing that is puzzling me is that I see pretty similar
latency/cluster CPU usage when using the node vs transport client (client
is on same network, cluster of 4 nodes). I am not that familiar with the
client code, but as far as I can tell, the bulk request tries to divide up
the documents per shard and route them using multiple requests. I would
think this would lower cluster CPU usage and latency as the documents would
be sent to the right node immediately rather than a potential 2 hop
operation when using the transport client? My reasoning is that on the
initial client -> node request, may need to parse the request and then
re-route the portion of requests that need to go to another node. Those
re-routed requests would then need to be re-parsed.
Bulk requests are not parsed before they reach the destination node. They
are split by a stream separator character (the line feed) and so the
destination node can be found by merely looking at the bulk doc header,
without doc parsing.
The bulk action sorts the bulk items so all docs for a shard are sent with
one network transmission.
The difference between TransportClient and NodeClient regarding the hops is
true, but it will take enormous load and network latency to see an effect.
And there are other tradeoff factors. For example, if expensive code is run
at TransportClient machine to build the JSON docs, this takes CPU load from
the ES cluster because otherwise it would have to be performed at cluster
side. This CPU win has to be balanced against the slightly slower network
transmission.
I have been running some performance tests with some bulk indexing
requests. One thing that is puzzling me is that I see pretty similar
latency/cluster CPU usage when using the node vs transport client (client
is on same network, cluster of 4 nodes). I am not that familiar with the
client code, but as far as I can tell, the bulk request tries to divide up
the documents per shard and route them using multiple requests. I would
think this would lower cluster CPU usage and latency as the documents would
be sent to the right node immediately rather than a potential 2 hop
operation when using the transport client? My reasoning is that on the
initial client -> node request, may need to parse the request and then
re-route the portion of requests that need to go to another node. Those
re-routed requests would then need to be re-parsed.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.