Can we create a node-client with ES/Hadoop or Transport client is the only way out?


(ted) #1

Hi,

I have gone through the documentation for https://www.elastic.co/guide/en/elasticsearch/hadoop/current/reference.html but could not see how to create a node-client.

Is TransportClient the only one available to be used in hadoop/storm?

Thanks
Ted


(Costin Leau) #2

ES-Hadoop only relies on REST and does not use the Transport client.


(ted) #3

Thanks Costin.

Does that mean ES-Hadoop cannot be used to create a node-client?

Sorry, I am very new to this and appreciate your help.


(Costin Leau) #4

It means that ES-Hadoop is a REST client and does not use the ES java client, thus there is no concept of node or transport client.
In other words no. And this is on purpose - if one creates a Spark job with 20 tasks suddenly the cluster has 20 node clients in the cluster for no real reason. Add another node and one ends up with 40.


(ted) #5

Hmm... I was hoping to use node-client feature along with shard-specific-routing to make sure our updates go directly to a primary shard because our use case involves heavy batches being sent to ES for indexing and the presence of a node-client option would have allowed to save a network hop for reaching to primary.

If node-client was present, we could do:
storm-bolt ----directly-to------> ES primary
But without node-client, we have to:
storm-bolt ---------> Any ES node (could be replica) --------redirect-to-----> ES primary

We are guessing that node-client would have been more efficient (although we dont yet have data to prove it).

As a last resort, we plan to find out the primary ourselves (by running a background thread to know cluster state) and try to route to primary always.

Note that we plan to use routing option also so that bolts do not have to send to many primary shards. (We do not want to search by ID ever and are sure that we can balance the documents appropriately among shards).


(Costin Leau) #6

Node client might or might not improve performance. As I've mentioned a node client is actually part of the cluster and thus it needs to be aware of its metadata, etc..
By simply talking to one of the cluster nodes, one delegates everything to the existing cluster; with the node client one gains the capability of doing the document splitting (what shard does it belong to) on the client at the expense of cluster state synchronization.
At a higher level, the work hasn't really disappeared it just moved from one machine to the other; if your cluster is not changing a lot and you do a lot of documents inserts it might pay off, if not, you are not gaining much.
Going forward, ES-Hadoop might expand the support it has for routing and use that across all its document (a global setting as oppose to per document).
But we're not there yet.


(system) #7