Few Querys related to ElasticSearch

Hi
Would appreciate if any of you can share your experience / thoughts on below
questions:

  1. REST Api vs Java api - Have read that Java api is much faster as it works
    at a lower level protocol. Do you guys have any comparison?
  2. What approach do you suggest for the below mentioned use case:
    I plan to index a stream of short messages (like Tweets) into ES.
    Now I don't want to keep say more than a month old data. How do I flush it?
  3. If I create 3 index files say a, b, c. How do I tell ES to search on all
    these indexes?
  4. ES seems to have good shard support. Is there a way to control these
    shards on capacity?

Thanks in advance for your help.

Regards
Gautam

Hi,

let me try to answer (inlining)"

On Mon, Nov 29, 2010 at 12:08 PM, Gautam Mr mrgautamsam@gmail.com wrote:

Hi
Would appreciate if any of you can share your experience / thoughts on
below questions:

  1. REST Api vs Java api - Have read that Java api is much faster as it
    works at a lower level protocol. Do you guys have any comparison?

It depends on what exactly you measure and also on your use case. First, as
of writing the REST API can be used via
HTTPhttp://www.elasticsearch.com/docs/elasticsearch/modules/http/or
Memcachedhttp://www.elasticsearch.com/docs/elasticsearch/modules/memcached/protocols.
Memcached protocol should be faster then HTTP (and has also some
minor downsides) but it depends on your client implementation (e.g. client
can be using slow implementation of HTTP client module under the hood). When
using Java API there different
optionshttp://www.elasticsearch.com/docs/elasticsearch/java_api/client/:
TransportClient or NodeClient. TransportClient is slower then NodeClient;
however, NodeClient joins directly the cluster while TransportCient does
not. Both (Java) clients use optimized binary protocol so they are faster
then HTTP and Memcached protocols.

  1. What approach do you suggest for the below mentioned use case:
    I plan to index a stream of short messages (like Tweets) into ES.
    Now I don't want to keep say more than a month old data. How do I flush it?

You can index your data by weeks (days, hours, ... etc, you name it) and
have each data bucket indexed into a specific index. You will end up with
more indices like: twitter-ww31, twitter-ww32, twitter-ww33 (...). Then you
can search across more
indiceshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/search/indices_types/and
drop old indices (see index
deletehttp://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/delete_index/).
Also note that each index can have
aliaseshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/aliases/so
this could help you to just search in one "index alias" while this
will
span to multiple indices automatically.

  1. If I create 3 index files say a, b, c. How do I tell ES to search on all
    these indexes?

See index aliaseshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/aliases/
.

  1. ES seems to have good shard support. Is there a way to control these
    shards on capacity?

You mean if shards are of the same size? As far as I understand the data is
split among shards evenly by edfault. So if you have 10MB of data and you
have 5 shards, then each shard would have around 2MB. However, there has
been implemented a new
routinghttp://www.elasticsearch.com/docs/elasticsearch/rest_api/index/#RoutingAPI
in 0.13.0 which gives you a chance to control shard routing (see this
ticket for details:
API: Allow to control document shard routing, and search shard routing · Issue #470 · elastic/elasticsearch · GitHub).

Thanks in advance for your help.

Regards
Gautam

Regards,
Lukas

On Mon, Nov 29, 2010 at 1:27 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

let me try to answer (inlining)"

On Mon, Nov 29, 2010 at 12:08 PM, Gautam Mr mrgautamsam@gmail.com wrote:

Hi
Would appreciate if any of you can share your experience / thoughts on
below questions:

  1. REST Api vs Java api - Have read that Java api is much faster as it
    works at a lower level protocol. Do you guys have any comparison?

It depends on what exactly you measure and also on your use case. First, as
of writing the REST API can be used via HTTPhttp://www.elasticsearch.com/docs/elasticsearch/modules/http/or
Memcachedhttp://www.elasticsearch.com/docs/elasticsearch/modules/memcached/protocols. Memcached protocol should be faster then HTTP (and has also some
minor downsides) but it depends on your client implementation (e.g. client
can be using slow implementation of HTTP client module under the hood). When
using Java API there different optionshttp://www.elasticsearch.com/docs/elasticsearch/java_api/client/:
TransportClient or NodeClient. TransportClient is slower then NodeClient;
however, NodeClient joins directly the cluster while TransportCient does
not. Both (Java) clients use optimized binary protocol so they are faster
then HTTP and Memcached protocols.

  1. What approach do you suggest for the below mentioned use case:
    I plan to index a stream of short messages (like Tweets) into ES.
    Now I don't want to keep say more than a month old data. How do I flush it?

You can index your data by weeks (days, hours, ... etc, you name it) and
have each data bucket indexed into a specific index. You will end up with
more indices like: twitter-ww31, twitter-ww32, twitter-ww33 (...). Then you
can search across more indiceshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/search/indices_types/and drop old indices (see index
deletehttp://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/delete_index/).
Also note that each index can have aliaseshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/aliases/so this could help you to just search in one "index alias" while this will
span to multiple indices automatically.

  1. If I create 3 index files say a, b, c. How do I tell ES to search on
    all these indexes?

See index aliaseshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/aliases/
.

Oops, I meant, see searching multiple
indiceshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/search/indices_types/,
but as I said above you can consider index aliases in this case.

  1. ES seems to have good shard support. Is there a way to control these
    shards on capacity?

You mean if shards are of the same size? As far as I understand the data is
split among shards evenly by edfault. So if you have 10MB of data and you
have 5 shards, then each shard would have around 2MB. However, there has
been implemented a new routinghttp://www.elasticsearch.com/docs/elasticsearch/rest_api/index/#RoutingAPI in 0.13.0 which gives you a chance to control shard routing (see this
ticket for details:
API: Allow to control document shard routing, and search shard routing · Issue #470 · elastic/elasticsearch · GitHub).

Thanks in advance for your help.

Regards
Gautam

Regards,
Lukas

Thanks a lot Lucas for the detailed reply. This helps a lot!!

Best Regards
Gautam

On Mon, Nov 29, 2010 at 6:00 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

On Mon, Nov 29, 2010 at 1:27 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Hi,

let me try to answer (inlining)"

On Mon, Nov 29, 2010 at 12:08 PM, Gautam Mr mrgautamsam@gmail.comwrote:

Hi
Would appreciate if any of you can share your experience / thoughts on
below questions:

  1. REST Api vs Java api - Have read that Java api is much faster as it
    works at a lower level protocol. Do you guys have any comparison?

It depends on what exactly you measure and also on your use case. First,
as of writing the REST API can be used via HTTPhttp://www.elasticsearch.com/docs/elasticsearch/modules/http/or
Memcachedhttp://www.elasticsearch.com/docs/elasticsearch/modules/memcached/protocols. Memcached protocol should be faster then HTTP (and has also some
minor downsides) but it depends on your client implementation (e.g. client
can be using slow implementation of HTTP client module under the hood). When
using Java API there different optionshttp://www.elasticsearch.com/docs/elasticsearch/java_api/client/:
TransportClient or NodeClient. TransportClient is slower then NodeClient;
however, NodeClient joins directly the cluster while TransportCient does
not. Both (Java) clients use optimized binary protocol so they are faster
then HTTP and Memcached protocols.

  1. What approach do you suggest for the below mentioned use case:
    I plan to index a stream of short messages (like Tweets) into
    ES. Now I don't want to keep say more than a month old data. How do I flush
    it?

You can index your data by weeks (days, hours, ... etc, you name it) and
have each data bucket indexed into a specific index. You will end up with
more indices like: twitter-ww31, twitter-ww32, twitter-ww33 (...). Then you
can search across more indiceshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/search/indices_types/and drop old indices (see index
deletehttp://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/delete_index/).
Also note that each index can have aliaseshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/aliases/so this could help you to just search in one "index alias" while this will
span to multiple indices automatically.

  1. If I create 3 index files say a, b, c. How do I tell ES to search on
    all these indexes?

See index aliaseshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/aliases/
.

Oops, I meant, see searching multiple indiceshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/search/indices_types/,
but as I said above you can consider index aliases in this case.

  1. ES seems to have good shard support. Is there a way to control these
    shards on capacity?

You mean if shards are of the same size? As far as I understand the data
is split among shards evenly by edfault. So if you have 10MB of data and you
have 5 shards, then each shard would have around 2MB. However, there has
been implemented a new routinghttp://www.elasticsearch.com/docs/elasticsearch/rest_api/index/#RoutingAPI in 0.13.0 which gives you a chance to control shard routing (see this
ticket for details:
API: Allow to control document shard routing, and search shard routing · Issue #470 · elastic/elasticsearch · GitHub).

Thanks in advance for your help.

Regards
Gautam

Regards,
Lukas

just a note regarding the HTTP vs. the native transport performance. I can
get really good performance with HTTP as well compared to the native one
using Java, it really depends on the http lib used by your programming
language of choice.

On Mon, Nov 29, 2010 at 8:42 PM, Gautam Mr mrgautamsam@gmail.com wrote:

Thanks a lot Lucas for the detailed reply. This helps a lot!!

Best Regards
Gautam

On Mon, Nov 29, 2010 at 6:00 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

On Mon, Nov 29, 2010 at 1:27 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Hi,

let me try to answer (inlining)"

On Mon, Nov 29, 2010 at 12:08 PM, Gautam Mr mrgautamsam@gmail.comwrote:

Hi
Would appreciate if any of you can share your experience / thoughts on
below questions:

  1. REST Api vs Java api - Have read that Java api is much faster as it
    works at a lower level protocol. Do you guys have any comparison?

It depends on what exactly you measure and also on your use case. First,
as of writing the REST API can be used via HTTPhttp://www.elasticsearch.com/docs/elasticsearch/modules/http/or
Memcachedhttp://www.elasticsearch.com/docs/elasticsearch/modules/memcached/protocols. Memcached protocol should be faster then HTTP (and has also some
minor downsides) but it depends on your client implementation (e.g. client
can be using slow implementation of HTTP client module under the hood). When
using Java API there different optionshttp://www.elasticsearch.com/docs/elasticsearch/java_api/client/:
TransportClient or NodeClient. TransportClient is slower then NodeClient;
however, NodeClient joins directly the cluster while TransportCient does
not. Both (Java) clients use optimized binary protocol so they are faster
then HTTP and Memcached protocols.

  1. What approach do you suggest for the below mentioned use case:
    I plan to index a stream of short messages (like Tweets) into
    ES. Now I don't want to keep say more than a month old data. How do I flush
    it?

You can index your data by weeks (days, hours, ... etc, you name it) and
have each data bucket indexed into a specific index. You will end up with
more indices like: twitter-ww31, twitter-ww32, twitter-ww33 (...). Then you
can search across more indiceshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/search/indices_types/and drop old indices (see index
deletehttp://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/delete_index/).
Also note that each index can have aliaseshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/aliases/so this could help you to just search in one "index alias" while this will
span to multiple indices automatically.

  1. If I create 3 index files say a, b, c. How do I tell ES to search on
    all these indexes?

See index aliaseshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/aliases/
.

Oops, I meant, see searching multiple indiceshttp://www.elasticsearch.com/docs/elasticsearch/rest_api/search/indices_types/,
but as I said above you can consider index aliases in this case.

  1. ES seems to have good shard support. Is there a way to control these
    shards on capacity?

You mean if shards are of the same size? As far as I understand the data
is split among shards evenly by edfault. So if you have 10MB of data and you
have 5 shards, then each shard would have around 2MB. However, there has
been implemented a new routinghttp://www.elasticsearch.com/docs/elasticsearch/rest_api/index/#RoutingAPI in 0.13.0 which gives you a chance to control shard routing (see this
ticket for details:
API: Allow to control document shard routing, and search shard routing · Issue #470 · elastic/elasticsearch · GitHub).

Thanks in advance for your help.

Regards
Gautam

Regards,
Lukas