It means that ES-Hadoop is a REST client and does not use the ES java client, thus there is no concept of node or transport client.
In other words no. And this is on purpose - if one creates a Spark job with 20 tasks suddenly the cluster has 20 node clients in the cluster for no real reason. Add another node and one ends up with 40.
Hmm... I was hoping to use node-client feature along with shard-specific-routing to make sure our updates go directly to a primary shard because our use case involves heavy batches being sent to ES for indexing and the presence of a node-client option would have allowed to save a network hop for reaching to primary.
If node-client was present, we could do:
storm-bolt ----directly-to------> ES primary
But without node-client, we have to:
storm-bolt ---------> Any ES node (could be replica) --------redirect-to-----> ES primary
We are guessing that node-client would have been more efficient (although we dont yet have data to prove it).
As a last resort, we plan to find out the primary ourselves (by running a background thread to know cluster state) and try to route to primary always.
Note that we plan to use routing option also so that bolts do not have to send to many primary shards. (We do not want to search by ID ever and are sure that we can balance the documents appropriately among shards).
Node client might or might not improve performance. As I've mentioned a node client is actually part of the cluster and thus it needs to be aware of its metadata, etc..
By simply talking to one of the cluster nodes, one delegates everything to the existing cluster; with the node client one gains the capability of doing the document splitting (what shard does it belong to) on the client at the expense of cluster state synchronization.
At a higher level, the work hasn't really disappeared it just moved from one machine to the other; if your cluster is not changing a lot and you do a lot of documents inserts it might pay off, if not, you are not gaining much.
Going forward, ES-Hadoop might expand the support it has for routing and use that across all its document (a global setting as oppose to per document).
But we're not there yet.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.