Proper understanding of different clients

Hi,

In my first post to this list, I'd like to start a brief discussion
where I can make sure I'm properly understanding the way the clients
for Elasticsearch (ES) work (consider just one single cluster). When I
say clients, I refer to the three types I've seen so far:

  • Node:
    When creating this client from our application, our application
    becomes a node of the ES cluster, but it doesn't store any data; i.e.
    my application will execute indexing/searching/etc processes, but the
    data will actually be in some other node (which in turn can be in the
    same machine/same JVM, same machine/different JVM, different machine/
    different JVM).
  • Local:
    When creating this client from our application, our application
    behaves like being a node plus it stores data as well locally.
  • Transport:
    When we create this type of client, we're just getting sort of a
    pointer to ES cluster, but we are not serving as a node whatsoever. We
    can index/search/etc, but our application is not part of the process,
    it just delegates the operations to some other node.

If all my assumptions are correct, the typical scenarios for each type
of client would be:

  • Node or Local: Imagine an application that basically scraps text
    from web pages and index it. I could have this application distributed
    in several JVMs (no matter whether they are in the same or different
    machines), and then make each JVM have a node client, so all my
    application as a whole serves as the indexer. Then I could create
    another node client (or nodes), to perform the searches.
  • Transport: This is actually what I'm looking for. My real scenario
    is the following: I have a web site that has a directory of shops in
    general. I want to provide a search functionality of all the products
    of all the shops classified by category, price, shops they belong to,
    etc. I get the products information through an HTTP request in a CSV
    format, so basically I want my application to read the CSV file, parse
    it, and then index each and every product it finds. However, for
    performance reasons, I don't want my application to be a node or the
    ES cluster itself, but have another JVM running with the ES cluster.
    So I'd have one JVM for my web application, then I would delegate the
    indexing to the other JVM running the ES cluster using the transport
    client. That way my application would neither be impacted by indexing/
    searching process nor would have to store the product's information.

What do you think? Is my preferred scenario feasible with the
transport client? If not, which client should I use then? Why?

Thanks a lot for your support and thanks to ES team for such an
awesome tool.

Hi,

On Fri, Jan 28, 2011 at 4:34 PM, Enrique Medina e.medina.m@gmail.comwrote:

Hi,

In my first post to this list, I'd like to start a brief discussion
where I can make sure I'm properly understanding the way the clients
for Elasticsearch (ES) work (consider just one single cluster). When I

Actually, to make this crystal clear, every client that is implemented in ES
now can connect just to one cluster at a time only.

say clients, I refer to the three types I've seen so far:

  • Node:
    When creating this client from our application, our application
    becomes a node of the ES cluster, but it doesn't store any data; i.e.
    my application will execute indexing/searching/etc processes, but the
    data will actually be in some other node (which in turn can be in the
    same machine/same JVM, same machine/different JVM, different machine/
    different JVM).

Just the Node becomes part of the cluster (not your application), however,
keep in mind that this can impact your application in terms of memory and
file system requirements, especially if shards are allocated on this node.
But you can configure the node to store or not to store the data (node.data
is set to true by default so you have to set it to false if you do not want
this node to be allocated any data shards and replicas). See
http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/ for
details about what "storing data locally" means in this context.

  • Local:
    When creating this client from our application, our application
    behaves like being a node plus it stores data as well locally

If you mean local Node in the sense that is described here:
http://www.elasticsearch.com/docs/elasticsearch/java_api/client/ (i.e.
created like nodeBuilder().local(true).node() for example) then it has
nothing to do with the data being stored on it or not. It just means that it
assumes that other nodes can be discovered (and communicated to) within the
same JVM. But I do not think this it is recommended for production usage
(this is mainly used for unit tests). In production it is better to run one
node per one JVM (especially if data can be store on this node) and even
better one node per one physical machine (to ensure better safety on machine
crash).

.

  • Transport:
    When we create this type of client, we're just getting sort of a
    pointer to ES cluster, but we are not serving as a node whatsoever. We
    can index/search/etc, but our application is not part of the process,
    it just delegates the operations to some other node.

Yes.

You can also use REST API in the same way.

If all my assumptions are correct, the typical scenarios for each type
of client would be:

  • Node or Local: Imagine an application that basically scraps text
    from web pages and index it. I could have this application distributed
    in several JVMs (no matter whether they are in the same or different
    machines), and then make each JVM have a node client, so all my
    application as a whole serves as the indexer. Then I could create
    another node client (or nodes), to perform the searches.

Yes. (just note that Local node does not make sense here as explained above)
Also I assume you mean node which is preferably configured using
node.client(true).

  • Transport: This is actually what I'm looking for. My real scenario
    is the following: I have a web site that has a directory of shops in
    general. I want to provide a search functionality of all the products
    of all the shops classified by category, price, shops they belong to,
    etc. I get the products information through an HTTP request in a CSV
    format, so basically I want my application to read the CSV file, parse
    it, and then index each and every product it finds. However, for
    performance reasons, I don't want my application to be a node or the
    ES cluster itself, but have another JVM running with the ES cluster.
    So I'd have one JVM for my web application, then I would delegate the
    indexing to the other JVM running the ES cluster using the transport
    client. That way my application would neither be impacted by indexing/
    searching process nor would have to store the product's information.

What do you think? Is my preferred scenario feasible with the
transport client? If not, which client should I use then? Why?

Definitely, it is feasible. Also you might be interested in looking at bulk
API http://www.elasticsearch.com/docs/elasticsearch/rest_api/bulk/ (it is
available in Java API as well).

Thanks a lot for your support and thanks to ES team for such an
awesome tool.

Regards,
Lukas

Thanks a lot for such a detailed answer Lukáš.

I'm definitively going to take a look at the Bulk API a well together with
the Transport mode. What would the limits of the Bulk API be? If I
understood this Bulk API properly, I could create a JSON doc with each
product from my CSV file that needs to be indexed and send it all together
to ES, instead of sending it one by one. But then, what if my CSV file has,
let's say, more than 1GB of products?

On Fri, Jan 28, 2011 at 8:49 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

On Fri, Jan 28, 2011 at 4:34 PM, Enrique Medina e.medina.m@gmail.comwrote:

Hi,

In my first post to this list, I'd like to start a brief discussion
where I can make sure I'm properly understanding the way the clients
for Elasticsearch (ES) work (consider just one single cluster). When I

Actually, to make this crystal clear, every client that is implemented in
ES now can connect just to one cluster at a time only.

say clients, I refer to the three types I've seen so far:

  • Node:
    When creating this client from our application, our application
    becomes a node of the ES cluster, but it doesn't store any data; i.e.
    my application will execute indexing/searching/etc processes, but the
    data will actually be in some other node (which in turn can be in the
    same machine/same JVM, same machine/different JVM, different machine/
    different JVM).

Just the Node becomes part of the cluster (not your application), however,
keep in mind that this can impact your application in terms of memory and
file system requirements, especially if shards are allocated on this node.
But you can configure the node to store or not to store the data (node.data
is set to true by default so you have to set it to false if you do not want
this node to be allocated any data shards and replicas). See
http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/ for
details about what "storing data locally" means in this context.

  • Local:
    When creating this client from our application, our application
    behaves like being a node plus it stores data as well locally

If you mean local Node in the sense that is described here:
http://www.elasticsearch.com/docs/elasticsearch/java_api/client/ (i.e.
created like nodeBuilder().local(true).node() for example) then it has
nothing to do with the data being stored on it or not. It just means that it
assumes that other nodes can be discovered (and communicated to) within the
same JVM. But I do not think this it is recommended for production usage
(this is mainly used for unit tests). In production it is better to run one
node per one JVM (especially if data can be store on this node) and even
better one node per one physical machine (to ensure better safety on machine
crash).

.

  • Transport:
    When we create this type of client, we're just getting sort of a
    pointer to ES cluster, but we are not serving as a node whatsoever. We
    can index/search/etc, but our application is not part of the process,
    it just delegates the operations to some other node.

Yes.

You can also use REST API in the same way.

If all my assumptions are correct, the typical scenarios for each type
of client would be:

  • Node or Local: Imagine an application that basically scraps text
    from web pages and index it. I could have this application distributed
    in several JVMs (no matter whether they are in the same or different
    machines), and then make each JVM have a node client, so all my
    application as a whole serves as the indexer. Then I could create
    another node client (or nodes), to perform the searches.

Yes. (just note that Local node does not make sense here as explained
above) Also I assume you mean node which is preferably configured using
node.client(true).

  • Transport: This is actually what I'm looking for. My real scenario
    is the following: I have a web site that has a directory of shops in
    general. I want to provide a search functionality of all the products
    of all the shops classified by category, price, shops they belong to,
    etc. I get the products information through an HTTP request in a CSV
    format, so basically I want my application to read the CSV file, parse
    it, and then index each and every product it finds. However, for
    performance reasons, I don't want my application to be a node or the
    ES cluster itself, but have another JVM running with the ES cluster.
    So I'd have one JVM for my web application, then I would delegate the
    indexing to the other JVM running the ES cluster using the transport
    client. That way my application would neither be impacted by indexing/
    searching process nor would have to store the product's information.

What do you think? Is my preferred scenario feasible with the
transport client? If not, which client should I use then? Why?

Definitely, it is feasible. Also you might be interested in looking at bulk
API http://www.elasticsearch.com/docs/elasticsearch/rest_api/bulk/ (it is
available in Java API as well).

Thanks a lot for your support and thanks to ES team for such an
awesome tool.

Regards,
Lukas

You will need to paginate the bulk requests. Say, 100 at a time.
On Monday, January 31, 2011 at 11:48 AM, Enrique Medina Montenegro wrote:

Thanks a lot for such a detailed answer Lukáš.

I'm definitively going to take a look at the Bulk API a well together with the Transport mode. What would the limits of the Bulk API be? If I understood this Bulk API properly, I could create a JSON doc with each product from my CSV file that needs to be indexed and send it all together to ES, instead of sending it one by one. But then, what if my CSV file has, let's say, more than 1GB of products?

On Fri, Jan 28, 2011 at 8:49 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

On Fri, Jan 28, 2011 at 4:34 PM, Enrique Medina e.medina.m@gmail.com wrote:

Hi,

In my first post to this list, I'd like to start a brief discussion
where I can make sure I'm properly understanding the way the clients
for Elasticsearch (ES) work (consider just one single cluster). When I

Actually, to make this crystal clear, every client that is implemented in ES now can connect just to one cluster at a time only.

say clients, I refer to the three types I've seen so far:

  • Node:
    When creating this client from our application, our application
    becomes a node of the ES cluster, but it doesn't store any data; i.e.
    my application will execute indexing/searching/etc processes, but the
    data will actually be in some other node (which in turn can be in the
    same machine/same JVM, same machine/different JVM, different machine/
    different JVM).

Just the Node becomes part of the cluster (not your application), however, keep in mind that this can impact your application in terms of memory and file system requirements, especially if shards are allocated on this node. But you can configure the node to store or not to store the data (node.data is set to true by default so you have to set it to false if you do not want this node to be allocated any data shards and replicas). See http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/ for details about what "storing data locally" means in this context.

  • Local:
    When creating this client from our application, our application
    behaves like being a node plus it stores data as well locally

If you mean local Node in the sense that is described here: http://www.elasticsearch.com/docs/elasticsearch/java_api/client/ (i.e. created like nodeBuilder().local(true).node() for example) then it has nothing to do with the data being stored on it or not. It just means that it assumes that other nodes can be discovered (and communicated to) within the same JVM. But I do not think this it is recommended for production usage (this is mainly used for unit tests). In production it is better to run one node per one JVM (especially if data can be store on this node) and even better one node per one physical machine (to ensure better safety on machine crash).

.

  • Transport:
    When we create this type of client, we're just getting sort of a
    pointer to ES cluster, but we are not serving as a node whatsoever. We
    can index/search/etc, but our application is not part of the process,
    it just delegates the operations to some other node.

Yes.

You can also use REST API in the same way.

If all my assumptions are correct, the typical scenarios for each type
of client would be:

  • Node or Local: Imagine an application that basically scraps text
    from web pages and index it. I could have this application distributed
    in several JVMs (no matter whether they are in the same or different
    machines), and then make each JVM have a node client, so all my
    application as a whole serves as the indexer. Then I could create
    another node client (or nodes), to perform the searches.

Yes. (just note that Local node does not make sense here as explained above) Also I assume you mean node which is preferably configured using node.client(true).

  • Transport: This is actually what I'm looking for. My real scenario
    is the following: I have a web site that has a directory of shops in
    general. I want to provide a search functionality of all the products
    of all the shops classified by category, price, shops they belong to,
    etc. I get the products information through an HTTP request in a CSV
    format, so basically I want my application to read the CSV file, parse
    it, and then index each and every product it finds. However, for
    performance reasons, I don't want my application to be a node or the
    ES cluster itself, but have another JVM running with the ES cluster.
    So I'd have one JVM for my web application, then I would delegate the
    indexing to the other JVM running the ES cluster using the transport
    client. That way my application would neither be impacted by indexing/
    searching process nor would have to store the product's information.

What do you think? Is my preferred scenario feasible with the
transport client? If not, which client should I use then? Why?

Definitely, it is feasible. Also you might be interested in looking at bulk API http://www.elasticsearch.com/docs/elasticsearch/rest_api/bulk/ (it is available in Java API as well).

Thanks a lot for your support and thanks to ES team for such an
awesome tool.

Regards,
Lukas