Controlling ES java client's internal connection pool

So we use ES java client within Twitter Storm (a distributed processing
system for streaming data). We use the ES client to perform searches based
on data coming through the stream. Each JVM within this distributed
cluster, has multiple threads using multiple es client instances (it's a
highly parallel system). But each es client instances is using netty
underneath to open its own set of connections and the size of connections
it uses is not controllable, at least I don't see how. Since I am providing
my own parallelism by running this in Storm, I don't need netty to manage
that for me.

a) Is there a way I can control the number of connections each es client
instance opens?
b) I can route specific routing keys to specific client instances, is it
possible to have a client instance only to talk to the node that's
responsible for a given routing key?

What I need is routing aware dumb client. It's smart in that it knows where
the data is and can connect to that node and only that node at start up.
But it should also be dumb in the sense that it should not provide any
parallelism or async processing or open multiple connections.

Any ideas? Let me know if you need further clarifications.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

IMHO you should have only one client per JVM not per Thread.
So you should avoid this: "using multiple es client instances"

My 0.05 cents.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 14 sept. 2013 à 19:01, "peter@vagaband.co" peter@vagaband.co a écrit :

So we use ES java client within Twitter Storm (a distributed processing system for streaming data). We use the ES client to perform searches based on data coming through the stream. Each JVM within this distributed cluster, has multiple threads using multiple es client instances (it's a highly parallel system). But each es client instances is using netty underneath to open its own set of connections and the size of connections it uses is not controllable, at least I don't see how. Since I am providing my own parallelism by running this in Storm, I don't need netty to manage that for me.

a) Is there a way I can control the number of connections each es client instance opens?
b) I can route specific routing keys to specific client instances, is it possible to have a client instance only to talk to the node that's responsible for a given routing key?

What I need is routing aware dumb client. It's smart in that it knows where the data is and can connect to that node and only that node at start up. But it should also be dumb in the sense that it should not provide any parallelism or async processing or open multiple connections.

Any ideas? Let me know if you need further clarifications.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

a) yes
b) no, since you can not delegate the index shard distribution to Twitter
Storm.

Did you use TransportClient? Did you disable sniff mode in TransportClient?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sorry.. message did not go list. Just to you.

a) if yes, how?
b) not trying to delegate shard distribution to storm. Rather trying to
glean knowledge of where shards are located and then assign responsibility
of talking to that shard to one particular storm worker thread.

Tried both Node and Transport clients. If turned sniff off, then that
particular client will not be aware of the nodes in the cluster and will
not know to reconnect to another node if a node is down. But also turning
sniff off does not mean it will only create one connection to that node...
it just means it will only connect to one node (maybe multiple
connections). Am I wrong?

On Saturday, September 14, 2013 1:53:58 PM UTC-4, Jörg Prante wrote:

a) yes
b) no, since you can not delegate the index shard distribution to Twitter
Storm.

Did you use TransportClient? Did you disable sniff mode in TransportClient?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sorry.. reply did not go list. Just to you.

That's not how Storm works. Storm provides a level of parallelization for
processing beyond a single JVM across multiple hosts and within a JVM, it
allows for tuning the number of threads and number of task instances that
aware available. So I can have 64 threads working on 128 searcher instances
within a cluster. If a cluster has 16 JVMs, then I can distribute these
given 64 threads / 128 instances through out 16 JVMs. Having more threads
than search instances allow me to tune how to control blocking and when to
timeout and retry and all these details at a much granular level.

When there's a heavy weight storm client underneath all of this trying to
circumvent all the efforts (by managing it's own connections) made by Storm
to parallelize this tasks, it kind of defeats the purpose of using Storm.
It sounds like the standard ES client is made for use under a standard
webapplication scenario and does not really provide any ability to tune how
its internals work beyond that.

On Saturday, September 14, 2013 1:25:30 PM UTC-4, David Pilato wrote:

IMHO you should have only one client per JVM not per Thread.
So you should avoid this: "using multiple es client instances"

My 0.05 cents.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 14 sept. 2013 à 19:01, "pe...@vagaband.co <javascript:>" <
pe...@vagaband.co <javascript:>> a écrit :

So we use ES java client within Twitter Storm (a distributed processing
system for streaming data). We use the ES client to perform searches based
on data coming through the stream. Each JVM within this distributed
cluster, has multiple threads using multiple es client instances (it's a
highly parallel system). But each es client instances is using netty
underneath to open its own set of connections and the size of connections
it uses is not controllable, at least I don't see how. Since I am providing
my own parallelism by running this in Storm, I don't need netty to manage
that for me.

a) Is there a way I can control the number of connections each es client
instance opens?
b) I can route specific routing keys to specific client instances, is it
possible to have a client instance only to talk to the node that's
responsible for a given routing key?

What I need is routing aware dumb client. It's smart in that it knows
where the data is and can connect to that node and only that node at start
up. But it should also be dumb in the sense that it should not provide any
parallelism or async processing or open multiple connections.

Any ideas? Let me know if you need further clarifications.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Meant to say "heavy weight ES client"

On Saturday, September 14, 2013 2:22:07 PM UTC-4, pe...@vagaband.co wrote:

Sorry.. reply did not go list. Just to you.

That's not how Storm works. Storm provides a level of parallelization for
processing beyond a single JVM across multiple hosts and within a JVM, it
allows for tuning the number of threads and number of task instances that
aware available. So I can have 64 threads working on 128 searcher instances
within a cluster. If a cluster has 16 JVMs, then I can distribute these
given 64 threads / 128 instances through out 16 JVMs. Having more threads
than search instances allow me to tune how to control blocking and when to
timeout and retry and all these details at a much granular level.

When there's a heavy weight storm client underneath all of this trying to
circumvent all the efforts (by managing it's own connections) made by Storm
to parallelize this tasks, it kind of defeats the purpose of using Storm.
It sounds like the standard ES client is made for use under a standard
webapplication scenario and does not really provide any ability to tune how
its internals work beyond that.

On Saturday, September 14, 2013 1:25:30 PM UTC-4, David Pilato wrote:

IMHO you should have only one client per JVM not per Thread.
So you should avoid this: "using multiple es client instances"

My 0.05 cents.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 14 sept. 2013 à 19:01, "pe...@vagaband.co" pe...@vagaband.co a
écrit :

So we use ES java client within Twitter Storm (a distributed processing
system for streaming data). We use the ES client to perform searches based
on data coming through the stream. Each JVM within this distributed
cluster, has multiple threads using multiple es client instances (it's a
highly parallel system). But each es client instances is using netty
underneath to open its own set of connections and the size of connections
it uses is not controllable, at least I don't see how. Since I am providing
my own parallelism by running this in Storm, I don't need netty to manage
that for me.

a) Is there a way I can control the number of connections each es client
instance opens?
b) I can route specific routing keys to specific client instances, is it
possible to have a client instance only to talk to the node that's
responsible for a given routing key?

What I need is routing aware dumb client. It's smart in that it knows
where the data is and can connect to that node and only that node at start
up. But it should also be dumb in the sense that it should not provide any
parallelism or async processing or open multiple connections.

Any ideas? Let me know if you need further clarifications.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sorry.. reply did not go list. Just to you.

That's not how Storm works. Storm provides a level of parallelization for
processing beyond a single JVM across multiple hosts and within a JVM, it
allows for tuning the number of threads and number of task instances that
aware available. So I can have 64 threads working on 128 searcher instances
within a cluster. If a cluster has 16 JVMs, then I can distribute these
given 64 threads / 128 instances through out 16 JVMs. Having more threads
than search instances allow to tune how to control blocking, and
additionally storm provides a way to control and tune a lot of things
(timeouts, retries, etc...) at a much granular level across the whole
distributed system, not just within a single JVM.

When there's a heavy weight ES client underneath all of this trying to
circumvent all the efforts (by managing it's own connections) made by Storm
to parallelize this tasks, it kind of defeats the purpose of using Storm.
It sounds like the standard ES client is made for use under a standard
webapplication scenario and does not really provide any ability to tune how
its internals work beyond that.

On Saturday, September 14, 2013 1:25:30 PM UTC-4, David Pilato wrote:

IMHO you should have only one client per JVM not per Thread.
So you should avoid this: "using multiple es client instances"

My 0.05 cents.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You can not "talk to a shard", this is not how ES works. You can talk to
nodes. With a client, you talk to a data node, and this data node holds all
the information about the whole cluster, including indices and their
shards, routing, mapping, aliases, ... so it has the full control over
action execution jobs received from a client. Each node can hold many
shards of many indexes. If a client want to address one node for indexing,
it can augment a single document with custom routing information. Then ES
will hash the doc routing info and pick the one node where the shard of the
doc is situated. By knowing the ES hashing function you could precompute
the shard number. But the ES hashing is not exposed to public API (for good
reason). Furthermore, even if the shard number is known, you can't always
limit the request to exactly one single node. If you have replica shards,
there is more than one node involved in an execution. So you could consider
only executing actions e.g. on a primary shard and drop the extra shard
replica safety, which I do not recommend. Shards are not tied to nodes,
they can move from node to node, controlled by a shard allocation
algorithm. So, a client can't access a shard on a node, only data nodes do,
they have access to the cluster state and they execute the action in behalf
of a client.

Disabling sniff means also disabling failover for this client. Failover
only works when connections to more than one node are established.

Check https://groups.google.com/forum/#!topic/elasticsearch/7-ob2IeYnMI

For minimum connections of a TransportClient, you could try this settings

ImmutableSettings.settingsBuilder()
.put("cluster.name", "elasticsearch")
.put("network.server", false)
.put("node.client", true)
.put("client.transport.sniff", false)
.put("transport.connections_per_node.low", 0)
.put("transport.connections_per_node.medium", 0)
.put("transport.connections_per_node.high", 1)
.build();

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks, Jörg. I was looking for Transport Client settings similar to this.
I couldn't find any in the docs.

I'm a little confused by the settings though. I thought you only specify
cluster name for Node client and then the client will lookup host names
from elasticsearch.yml to connect to. Does Transport client also look up
hosts from yaml or does it need to be provided by calling
addTransportAddress(..) on TransportClient?

How does Transport client work actually? If given multiple hosts via
addTransportAddress, but connections are limited to 0/0/1 for the above
settings? Does it connect to one of them randomly and if that node goes
down, does it connect to a different one from the list? I'd like to have
failover via reconnect to new node, but only maintain one connection.

I understand that you cannot talk directly to a shard, but you can query
with a routing key and if you're using node client it will intelligently
connect to the right node with right shard. That's the benefit of node
client. I just want to make this sticky. One dedicated client per shard (or
randomly selected replica for that shard).

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

TransportClient looks only for nodes on the addresses provided by
addTransportAddress(),
there are no configured nodes.

If a TransportClient opens connections to many nodes, the setting will work
per connected node.

NodeClient and TransportClient are quite the same, they use the same
codebase. A NodeClient uses opaque lookup over the network for discovery
whereas the TransportClient can discover clusters from a remote network
address by filtering the network interface. There is a small convenience
benefit of NodeClient. It is easier to use a node client if the client
resides on the same JVM as a data node because no network transport is
required in that case.

Sorry to disappoint you but I'm afraid there is no way to "stick" a client
to a single shard as long as you have many nodes, many indexes, and many
shards. The only method that comes to my mind is to reduce a cluster to a
trivial single node / single shard setup.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Wow... if that is true, then I guess the documentation and what ES folks
teach at their training is wrong, or at least misleading. Their insistence
on preferring node client over transport client was solely based on the
fact that it's a smarter client.

From the documentation -

"The benefit of using the Client is the fact that operations are
automatically routed to the node(s) the operations need to be executed on,
without performing a “double hop”. For example, the index operation will
automatically be executed on the shard that it will end up existing at."

Is that wrong?

It is indeed a sad day if we don't allow for smarter client which are
capable of connecting to the right node to perform that operation in the
most efficient way. I understand that this is tricky, that number of nodes,
the indices, the shards can change (as node come online/offline, as new
indices come and go, as replicas are added and removed, etc). But there's
no reason why it cannot be done (even if it's not available or exposed
now). As information changes, it can be communicated to clients and clients
can adjust or reconnect as necessary.

Lot of distributed systems vendors have come to the realization that lot of
problems inherent with distributed systems can be solved by way of building
smarter clients. This is evident based on talks given at conferences for
distributed systems. Hopefully, ES team looks at it the same way. Obviously
building something like this can be done in a clean way and without
hindering other features. The main reason why I like Elasticsearch over
SOLR is that ES folks (Shay in particular) seems to be doing everything
right in terms of putting the right level of intelligence in the right
place. A lot of thought has been put in to building ES that is very nuanced
but also complete in terms of the big picture.

And I hope a smarter client is one of the things on their roadmap.

Thank you for shedding light on this, Jörg.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The "double hop" is just another word for the extra network transport step
between cluster and TransportClient. The request and the response need to
travel over the wire to the requesting client. NodeClient does not require
this network transport - this may be a benefit. For NodeClient and
TransportClient, operations are automatically routed, there is no different
behavior.

To me, because I am only using ES from remote JVMs, TransportClient is
preferable, while other prefer NodeClient and work locally. It is a matter
of taste in the end.

What I don't understand is the motivation for shard-based access. It is
quite unusual because the combination of distributed shards into an index
and performing search on index level is one of the main features and one of
the strengths of ES.

Many features of ES rely on strict index level access, like searching. If
shard-exclusive access was allowed, you would have to reconsider the search
operation also for single shards only. These shard operations do exist in
the ES codebase, but they are low-level and hidden from the public API
because they are sensible to touch (and, the Lucene API is fluctuating from
version to version in that area).

On the shard level, clients are able to access Lucene index engine
directly, but are not be able to access the ES index API level with all the
mapping information of fields in documents, alias etc.

A possible method to expose the inner shard-based functions is passing
parameters to custom implementations of index actions. For example,
bypassing the internal shard number calculation by controlling this number
by a parameter from outside via public API. This may conflict with the
current concept of doc routing, so implementing a new, simplified index
action could be required. I think it is possible to implement an ES plugin
for this kind of experiments.

Jörg

On Sat, Sep 14, 2013 at 11:51 PM, peter@vagaband.co peter@vagaband.cowrote:

Wow... if that is true, then I guess the documentation and what ES folks
teach at their training is wrong, or at least misleading. Their insistence
on preferring node client over transport client was solely based on the
fact that it's a smarter client.

From the documentation -

"The benefit of using the Client is the fact that operations are
automatically routed to the node(s) the operations need to be executed on,
without performing a “double hop”. For example, the index operation will
automatically be executed on the shard that it will end up existing at."

Is that wrong?

It is indeed a sad day if we don't allow for smarter client which are
capable of connecting to the right node to perform that operation in the
most efficient way. I understand that this is tricky, that number of nodes,
the indices, the shards can change (as node come online/offline, as new
indices come and go, as replicas are added and removed, etc). But there's
no reason why it cannot be done (even if it's not available or exposed
now). As information changes, it can be communicated to clients and clients
can adjust or reconnect as necessary.

Lot of distributed systems vendors have come to the realization that lot
of problems inherent with distributed systems can be solved by way of
building smarter clients. This is evident based on talks given at
conferences for distributed systems. Hopefully, ES team looks at it the
same way. Obviously building something like this can be done in a clean way
and without hindering other features. The main reason why I like
Elasticsearch over SOLR is that ES folks (Shay in particular) seems to be
doing everything right in terms of putting the right level of intelligence
in the right place. A lot of thought has been put in to building ES that is
very nuanced but also complete in terms of the big picture.

And I hope a smarter client is one of the things on their roadmap.

Thank you for shedding light on this, Jörg.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

peter@vagaband.co wrote:

Wow... if that is true, then I guess the documentation and what
ES folks teach at their training is wrong, or at least
misleading. Their insistence on preferring node client over
transport client was solely based on the fact that it's a
smarter client.

[...]

It is indeed a sad day if we don't allow for smarter client
which are capable of connecting to the right node to perform
that operation in the most efficient way.

I didn't exactly follow Jörg, but let me clarify.

The node client is smarter than the transport client in the way
you first thought. It's called the "node" client because it's a
mostly functioning node in the cluster. The only exceptions are
that it can't be elected master and it can't store data. It can
do everything else, namely, it can target the exact data nodes
required for an operation. No need to involve a random other data
node.

The transport client is less intelligent. It can't do anything
until it establishes a connection with a node in the cluster,
whether that's a client or data node. It can, however, sniff the
nodes in the cluster to maximize its chances at finding someone to
talk to, but it doesn't have any benefits over connecting to a
node over HTTP in regard to shard-level operations.

A design I like to recommend for those having network
parallelization issues, which might be the case in your Storm
environment, is to run a node client on the same machine as your
application nodes and then talk to that via localhost with either
the transport client or through HTTP. The transport client will
give you the best perf but HTTP will give you more flexibility by
not tying your development cycles to ES versioning. Regardless of
which you choose to talk to the local client node, you're
guaranteed that ES traffic leaving your application box will
target the correct data node(s), thus minimizing unneeded traffic.

Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Le 14 sept. 2013 à 23:24, "joergprante@gmail.com" joergprante@gmail.com a écrit :

TransportClient looks only for nodes on the addresses provided by addTransportAddress(), there are no configured nodes.

This is true if you don't enable sniff (default).

Here's what the doc says: Elasticsearch Platform — Find real-time answers at scale | Elastic

The client allows to sniff the rest of the cluster, and add those into its list of machines to use. In this case, note that the ip addresses used will be the ones that the other nodes were started with (the “publish” address). In order to enable it, set the client.transport.sniff to true:

Settings settings = ImmutableSettings.settingsBuilder()
.put("client.transport.sniff", true).build();
TransportClient client = new TransportClient(settings);

Best

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Le 14 sept. 2013 à 21:26, "peter@vagaband.co" peter@vagaband.co a écrit :

Thanks, Jörg. I was looking for Transport Client settings similar to this. I couldn't find any in the docs.
I guess it's something we should add in Elasticsearch Platform — Find real-time answers at scale | Elastic docs. I will look at missing settings.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sorry for the delayed reply. Just getting back from vacation.

I'm very happy to hear that, Drew. So node client is indeed what I want.

I don't think running a separate node client on the same machine as storm
worker nodes is a good idea architecturally or operationally. Doing that
will involve opening more sockets to communicate back and forth and
therefore introduce an additional level of blocking i/o. On top of that,
there is zero control over the level parallelization that is possible, so
it doesn't really solve the problem it just moves it. Plus it increases
operational complexity by an order of magnitude by having another system to
maintain in parallel. Nevertheless, I appreciate the suggestion.

While node client looks good for use within a typical application, for use
within distributed processing systems it can be improved a bit with a few
more tuning parameters in the future I hope. Maybe allow to configure the
level of parallelization (i.e. # of connections, etc) it uses internally
with Netty. Any thoughts on that?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.