Proper understanding of different clients

Enrique_Medina_Monte · January 28, 2011, 3:34pm

Hi,

In my first post to this list, I'd like to start a brief discussion
where I can make sure I'm properly understanding the way the clients
for Elasticsearch (ES) work (consider just one single cluster). When I
say clients, I refer to the three types I've seen so far:

Node:
When creating this client from our application, our application
becomes a node of the ES cluster, but it doesn't store any data; i.e.
my application will execute indexing/searching/etc processes, but the
data will actually be in some other node (which in turn can be in the
same machine/same JVM, same machine/different JVM, different machine/
different JVM).
Local:
When creating this client from our application, our application
behaves like being a node plus it stores data as well locally.
Transport:
When we create this type of client, we're just getting sort of a
pointer to ES cluster, but we are not serving as a node whatsoever. We
can index/search/etc, but our application is not part of the process,
it just delegates the operations to some other node.

If all my assumptions are correct, the typical scenarios for each type
of client would be:

Node or Local: Imagine an application that basically scraps text
from web pages and index it. I could have this application distributed
in several JVMs (no matter whether they are in the same or different
machines), and then make each JVM have a node client, so all my
application as a whole serves as the indexer. Then I could create
another node client (or nodes), to perform the searches.
Transport: This is actually what I'm looking for. My real scenario
is the following: I have a web site that has a directory of shops in
general. I want to provide a search functionality of all the products
of all the shops classified by category, price, shops they belong to,
etc. I get the products information through an HTTP request in a CSV
format, so basically I want my application to read the CSV file, parse
it, and then index each and every product it finds. However, for
performance reasons, I don't want my application to be a node or the
ES cluster itself, but have another JVM running with the ES cluster.
So I'd have one JVM for my web application, then I would delegate the
indexing to the other JVM running the ES cluster using the transport
client. That way my application would neither be impacted by indexing/
searching process nor would have to store the product's information.

What do you think? Is my preferred scenario feasible with the
transport client? If not, which client should I use then? Why?

Thanks a lot for your support and thanks to ES team for such an
awesome tool.

Lukas_Vlcek1 · January 28, 2011, 7:49pm

Hi,

On Fri, Jan 28, 2011 at 4:34 PM, Enrique Medina e.medina.m@gmail.comwrote:

Hi,

In my first post to this list, I'd like to start a brief discussion
where I can make sure I'm properly understanding the way the clients
for Elasticsearch (ES) work (consider just one single cluster). When I

Actually, to make this crystal clear, every client that is implemented in ES
now can connect just to one cluster at a time only.

say clients, I refer to the three types I've seen so far:

Node:
When creating this client from our application, our application
becomes a node of the ES cluster, but it doesn't store any data; i.e.
my application will execute indexing/searching/etc processes, but the
data will actually be in some other node (which in turn can be in the
same machine/same JVM, same machine/different JVM, different machine/
different JVM).

Just the Node becomes part of the cluster (not your application), however,
keep in mind that this can impact your application in terms of memory and
file system requirements, especially if shards are allocated on this node.
But you can configure the node to store or not to store the data (node.data
is set to true by default so you have to set it to false if you do not want
this node to be allocated any data shards and replicas). See
http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/ for
details about what "storing data locally" means in this context.

Local:
When creating this client from our application, our application
behaves like being a node plus it stores data as well locally

If you mean local Node in the sense that is described here:
http://www.elasticsearch.com/docs/elasticsearch/java_api/client/ (i.e.
created like nodeBuilder().local(true).node() for example) then it has
nothing to do with the data being stored on it or not. It just means that it
assumes that other nodes can be discovered (and communicated to) within the
same JVM. But I do not think this it is recommended for production usage
(this is mainly used for unit tests). In production it is better to run one
node per one JVM (especially if data can be store on this node) and even
better one node per one physical machine (to ensure better safety on machine
crash).

.

Transport:
When we create this type of client, we're just getting sort of a
pointer to ES cluster, but we are not serving as a node whatsoever. We
can index/search/etc, but our application is not part of the process,
it just delegates the operations to some other node.

Yes.

You can also use REST API in the same way.

If all my assumptions are correct, the typical scenarios for each type
of client would be:

Node or Local: Imagine an application that basically scraps text
from web pages and index it. I could have this application distributed
in several JVMs (no matter whether they are in the same or different
machines), and then make each JVM have a node client, so all my
application as a whole serves as the indexer. Then I could create
another node client (or nodes), to perform the searches.

Yes. (just note that Local node does not make sense here as explained above)
Also I assume you mean node which is preferably configured using
node.client(true).

Transport: This is actually what I'm looking for. My real scenario
is the following: I have a web site that has a directory of shops in
general. I want to provide a search functionality of all the products
of all the shops classified by category, price, shops they belong to,
etc. I get the products information through an HTTP request in a CSV
format, so basically I want my application to read the CSV file, parse
it, and then index each and every product it finds. However, for
performance reasons, I don't want my application to be a node or the
ES cluster itself, but have another JVM running with the ES cluster.
So I'd have one JVM for my web application, then I would delegate the
indexing to the other JVM running the ES cluster using the transport
client. That way my application would neither be impacted by indexing/
searching process nor would have to store the product's information.

What do you think? Is my preferred scenario feasible with the
transport client? If not, which client should I use then? Why?

Definitely, it is feasible. Also you might be interested in looking at bulk
API http://www.elasticsearch.com/docs/elasticsearch/rest_api/bulk/ (it is
available in Java API as well).

Thanks a lot for your support and thanks to ES team for such an
awesome tool.

Regards,
Lukas

Enrique_Medina_Monte · January 31, 2011, 9:48am

Thanks a lot for such a detailed answer Lukáš.

I'm definitively going to take a look at the Bulk API a well together with
the Transport mode. What would the limits of the Bulk API be? If I
understood this Bulk API properly, I could create a JSON doc with each
product from my CSV file that needs to be indexed and send it all together
to ES, instead of sending it one by one. But then, what if my CSV file has,
let's say, more than 1GB of products?

On Fri, Jan 28, 2011 at 8:49 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

On Fri, Jan 28, 2011 at 4:34 PM, Enrique Medina e.medina.m@gmail.comwrote:

Hi,

In my first post to this list, I'd like to start a brief discussion
where I can make sure I'm properly understanding the way the clients
for Elasticsearch (ES) work (consider just one single cluster). When I

Actually, to make this crystal clear, every client that is implemented in
ES now can connect just to one cluster at a time only.

say clients, I refer to the three types I've seen so far:

Node:
When creating this client from our application, our application
becomes a node of the ES cluster, but it doesn't store any data; i.e.
my application will execute indexing/searching/etc processes, but the
data will actually be in some other node (which in turn can be in the
same machine/same JVM, same machine/different JVM, different machine/
different JVM).

Just the Node becomes part of the cluster (not your application), however,
keep in mind that this can impact your application in terms of memory and
file system requirements, especially if shards are allocated on this node.
But you can configure the node to store or not to store the data (node.data
is set to true by default so you have to set it to false if you do not want
this node to be allocated any data shards and replicas). See
http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/ for
details about what "storing data locally" means in this context.

Local:
When creating this client from our application, our application
behaves like being a node plus it stores data as well locally

If you mean local Node in the sense that is described here:
http://www.elasticsearch.com/docs/elasticsearch/java_api/client/ (i.e.
created like nodeBuilder().local(true).node() for example) then it has
nothing to do with the data being stored on it or not. It just means that it
assumes that other nodes can be discovered (and communicated to) within the
same JVM. But I do not think this it is recommended for production usage
(this is mainly used for unit tests). In production it is better to run one
node per one JVM (especially if data can be store on this node) and even
better one node per one physical machine (to ensure better safety on machine
crash).

.

Transport:
When we create this type of client, we're just getting sort of a
pointer to ES cluster, but we are not serving as a node whatsoever. We
can index/search/etc, but our application is not part of the process,
it just delegates the operations to some other node.

Yes.

You can also use REST API in the same way.

If all my assumptions are correct, the typical scenarios for each type
of client would be:

Node or Local: Imagine an application that basically scraps text
from web pages and index it. I could have this application distributed
in several JVMs (no matter whether they are in the same or different
machines), and then make each JVM have a node client, so all my
application as a whole serves as the indexer. Then I could create
another node client (or nodes), to perform the searches.

Yes. (just note that Local node does not make sense here as explained
above) Also I assume you mean node which is preferably configured using
node.client(true).

Transport: This is actually what I'm looking for. My real scenario
is the following: I have a web site that has a directory of shops in
general. I want to provide a search functionality of all the products
of all the shops classified by category, price, shops they belong to,
etc. I get the products information through an HTTP request in a CSV
format, so basically I want my application to read the CSV file, parse
it, and then index each and every product it finds. However, for
performance reasons, I don't want my application to be a node or the
ES cluster itself, but have another JVM running with the ES cluster.
So I'd have one JVM for my web application, then I would delegate the
indexing to the other JVM running the ES cluster using the transport
client. That way my application would neither be impacted by indexing/
searching process nor would have to store the product's information.

What do you think? Is my preferred scenario feasible with the
transport client? If not, which client should I use then? Why?

Definitely, it is feasible. Also you might be interested in looking at bulk
API http://www.elasticsearch.com/docs/elasticsearch/rest_api/bulk/ (it is
available in Java API as well).

Thanks a lot for your support and thanks to ES team for such an
awesome tool.

Regards,
Lukas

kimchy · January 31, 2011, 8:04pm

You will need to paginate the bulk requests. Say, 100 at a time.
On Monday, January 31, 2011 at 11:48 AM, Enrique Medina Montenegro wrote:

Thanks a lot for such a detailed answer LukÃ¡Å¡.

I'm definitively going to take a look at the Bulk API a well together with the Transport mode. What would the limits of the Bulk API be? If I understood this Bulk API properly, I could create a JSON doc with each product from my CSV file that needs to be indexed and send it all together to ES, instead of sending it one by one. But then, what if my CSV file has, let's say, more than 1GB of products?

On Fri, Jan 28, 2011 at 8:49 PM, LukÃ¡Å¡ VlÄek lukas.vlcek@gmail.com wrote:

Hi,

On Fri, Jan 28, 2011 at 4:34 PM, Enrique Medina e.medina.m@gmail.com wrote:

Hi,

In my first post to this list, I'd like to start a brief discussion
where I can make sure I'm properly understanding the way the clients
for Elasticsearch (ES) work (consider just one single cluster). When I

Actually, to make this crystal clear, every client that is implemented in ES now can connect just to one cluster at a time only.

say clients, I refer to the three types I've seen so far:

Node:
When creating this client from our application, our application
becomes a node of the ES cluster, but it doesn't store any data; i.e.
my application will execute indexing/searching/etc processes, but the
data will actually be in some other node (which in turn can be in the
same machine/same JVM, same machine/different JVM, different machine/
different JVM).

Just the Node becomes part of the cluster (not your application), however, keep in mind that this can impact your application in terms of memory and file system requirements, especially if shards are allocated on this node. But you can configure the node to store or not to store the data (node.data is set to true by default so you have to set it to false if you do not want this node to be allocated any data shards and replicas). See http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/ for details about what "storing data locally" means in this context.

Local:
When creating this client from our application, our application
behaves like being a node plus it stores data as well locally

If you mean local Node in the sense that is described here: http://www.elasticsearch.com/docs/elasticsearch/java_api/client/ (i.e. created like nodeBuilder().local(true).node() for example) then it has nothing to do with the data being stored on it or not. It just means that it assumes that other nodes can be discovered (and communicated to) within the same JVM. But I do not think this it is recommended for production usage (this is mainly used for unit tests). In production it is better to run one node per one JVM (especially if data can be store on this node) and even better one node per one physical machine (to ensure better safety on machine crash).

.

Transport:
When we create this type of client, we're just getting sort of a
pointer to ES cluster, but we are not serving as a node whatsoever. We
can index/search/etc, but our application is not part of the process,
it just delegates the operations to some other node.

Yes.

You can also use REST API in the same way.

If all my assumptions are correct, the typical scenarios for each type
of client would be:

Node or Local: Imagine an application that basically scraps text
from web pages and index it. I could have this application distributed
in several JVMs (no matter whether they are in the same or different
machines), and then make each JVM have a node client, so all my
application as a whole serves as the indexer. Then I could create
another node client (or nodes), to perform the searches.

Yes. (just note that Local node does not make sense here as explained above) Also I assume you mean node which is preferably configured using node.client(true).

Transport: This is actually what I'm looking for. My real scenario
is the following: I have a web site that has a directory of shops in
general. I want to provide a search functionality of all the products
of all the shops classified by category, price, shops they belong to,
etc. I get the products information through an HTTP request in a CSV
format, so basically I want my application to read the CSV file, parse
it, and then index each and every product it finds. However, for
performance reasons, I don't want my application to be a node or the
ES cluster itself, but have another JVM running with the ES cluster.
So I'd have one JVM for my web application, then I would delegate the
indexing to the other JVM running the ES cluster using the transport
client. That way my application would neither be impacted by indexing/
searching process nor would have to store the product's information.

What do you think? Is my preferred scenario feasible with the
transport client? If not, which client should I use then? Why?

Definitely, it is feasible. Also you might be interested in looking at bulk API http://www.elasticsearch.com/docs/elasticsearch/rest_api/bulk/ (it is available in Java API as well).

Thanks a lot for your support and thanks to ES team for such an
awesome tool.

Regards,
Lukas

Topic		Replies	Views
Client Nodes - Efficiency Elasticsearch	5	250	July 6, 2017
Java client usage when having 2 web apps Elasticsearch	1	288	July 6, 2017
Java Client Node (connecting to client nodes in cluster) Elasticsearch	2	704	July 5, 2017
What is the correct way of handling client in Elasticsearch? Elasticsearch	5	474	July 5, 2017
Elasticsearch Client node type? Elasticsearch	6	625	January 5, 2018

Proper understanding of different clients

Related topics