Elasticsearch thoughts; also, a bug


(Antonio Lobato) #1

Hey all. I just wanted to share my experiences with elasticsearch on
a 10 node cluster and my thoughts. I've had a lot of fun using it,
and will make heavy use of it in the future, :slight_smile:

http://blog.jinked.com/posts/30

I wanted to take the time to address two things really quickly
though. To answer Shay's question, the trivial lacking areas in
elasticsearch where Solr has functionality that elasticsearch doesn't,
are things such as the contrib/data import handler. We actually do use
it in one case, but only because of a limitation in Solr that doesn't
exist in elasticsearch (NRT, to be exact). Moving over to
elasticsearch from Solr would eliminate our need for the data import
handler, as we can distribute/index this particular dataset in real
time.

The second thing I wanted to bring up was a bug. I've seen this
happen before with certain software (including software I've written
my self), so it's not too uncommon. elasticsearch will listen on
0.0.0.0 by default, but if the machine has two IP addresses, it will
actually listen on one IP address and send requests/multicast out on
the other. The result is elasticsearch will not become a member of
the cluster and will isolate it self on an island on that one
machine. Of course, one work around is to define the IP to bind to in
the config, but it'd certainly go a long way in ease of use if we can
just setup elasticsearch and go. In particular since we share a single
config file on a cluster file system.

Flippin' amazing software though. elasticsearch's cluster model is
how all multi-node software should function.

Cheers!
-Antonio


(K.B.) #2

As I'm coming from SORL, too:

At first I also missed the DIH - I thought of it as a nice ans easy
way to index/ update my data from DB and thought the XML file needed
for this really is cool. However, after my first steps with ES and its
Java-API (that still feels a bit uncomfortable for me) I finally found
out that I don't miss DIH anymore. In fact, I now enjoy the greater
ease of use and direct transformations direct in java compared to
SOLR's DIH.

What I only miss to date from SOLR is core-swapping; In case you want
a complete new re-index of data and empty your core forehand, all
searches between delete and index are empty. This means that you need
at least 2 cores while 1 is searched and the other re-indexed. SOLR
allows you to have swap 2 cores and thei indices. For ES I came to the
solution (thansk to the IRC group) to use 2 index and just swap the
aliases but this still feels less comfortable and secure especially as
I couldn't find out till to date how to get the index one alias points
to in java API.

Also what is a slight problem is the fact that in case you index many
data into on index of ES the searches on other indexes gets sometimes
too slow - here a configurable maximum impact rate would be nice, so
one could make sure that at least e.g.: 50% of CPU and RAM of ES is
used on searches and not affected by index operations on other cores.

my 2 cent

Best,

-Korbinian

On 21 Mrz., 03:51, Antonio Lobato aj.lob...@gmail.com wrote:

Hey all. I just wanted to share my experiences with elasticsearch on
a 10 node cluster and my thoughts. I've had a lot of fun using it,
and will make heavy use of it in the future, :slight_smile:

http://blog.jinked.com/posts/30

I wanted to take the time to address two things really quickly
though. To answer Shay's question, the trivial lacking areas in
elasticsearch where Solr has functionality that elasticsearch doesn't,
are things such as the contrib/data import handler. We actually do use
it in one case, but only because of a limitation in Solr that doesn't
exist in elasticsearch (NRT, to be exact). Moving over to
elasticsearch from Solr would eliminate our need for the data import
handler, as we can distribute/index this particular dataset in real
time.

The second thing I wanted to bring up was a bug. I've seen this
happen before with certain software (including software I've written
my self), so it's not too uncommon. elasticsearch will listen on
0.0.0.0 by default, but if the machine has two IP addresses, it will
actually listen on one IP address and send requests/multicast out on
the other. The result is elasticsearch will not become a member of
the cluster and will isolate it self on an island on that one
machine. Of course, one work around is to define the IP to bind to in
the config, but it'd certainly go a long way in ease of use if we can
just setup elasticsearch and go. In particular since we share a single
config file on a cluster file system.

Flippin' amazing software though. elasticsearch's cluster model is
how all multi-node software should function.

Cheers!
-Antonio


(Shay Banon) #3

inlined
On Monday, March 21, 2011 at 4:51 AM, Antonio Lobato wrote:

Hey all. I just wanted to share my experiences with elasticsearch on
a 10 node cluster and my thoughts. I've had a lot of fun using it,
and will make heavy use of it in the future, :slight_smile:

http://blog.jinked.com/posts/30

I wanted to take the time to address two things really quickly
though. To answer Shay's question, the trivial lacking areas in
elasticsearch where Solr has functionality that elasticsearch doesn't,
are things such as the contrib/data import handler. We actually do use
it in one case, but only because of a limitation in Solr that doesn't
exist in elasticsearch (NRT, to be exact). Moving over to
elasticsearch from Solr would eliminate our need for the data import
handler, as we can distribute/index this particular dataset in real
time.
Not too familiar with the data import handler, but, if you are missing the option to import data from a database, and index it, then yes, elasticsearch does not do it automatically.

This is for a simple reason, I am not familiar with a nice way of configuring a simple import of data from a database that also makes sure to index changes done on the database. Also, because of the relational model, the data you want to index can come in different flavors (joins and what not), and then how does that relate to realtime change streaming through the database.

Sure, the simple case of indexing a single table and monitor changes based on timestamp (but, then, how do you handle deletes?) can be implemented. But thats the exception, not the rule, specially with the relational model.

Never got the fascination of some developers with writing complex configuration trying to tell a software what to do, compared to writing code that does that in half the time and in much more readable format (code, and its your code). Writing a script / code that does exactly what you want, with your (relational) data, and controlling how to index it is much simpler then trying to understand how a complex mapping tool works and how to configure it.

Thats my thinking. Of course, this does not mean that in the future, you won't be able to write code that express how to index a database and have it run as a river.

The second thing I wanted to bring up was a bug. I've seen this
happen before with certain software (including software I've written
my self), so it's not too uncommon. elasticsearch will listen on
0.0.0.0 by default, but if the machine has two IP addresses, it will
actually listen on one IP address and send requests/multicast out on
the other. The result is elasticsearch will not become a member of
the cluster and will isolate it self on an island on that one
machine. Of course, one work around is to define the IP to bind to in
the config, but it'd certainly go a long way in ease of use if we can
just setup elasticsearch and go. In particular since we share a single
config file on a cluster file system.
Heya, yea, thats how the multicast work, but, you can also use unicast discovery if you want. Note, I tried to address that using the special tokens in the host configurations, like en0: http://www.elasticsearch.org/guide/reference/modules/network.html.

Flippin' amazing software though. elasticsearch's cluster model is
how all multi-node software should function.
Thanks!.

Cheers!
-Antonio


(Shay Banon) #4

inlined
On Monday, March 21, 2011 at 11:45 AM, K.B. wrote:

As I'm coming from SORL, too:

At first I also missed the DIH - I thought of it as a nice ans easy
way to index/ update my data from DB and thought the XML file needed
for this really is cool. However, after my first steps with ES and its
Java-API (that still feels a bit uncomfortable for me) I finally found
out that I don't miss DIH anymore. In fact, I now enjoy the greater
ease of use and direct transformations direct in java compared to
SOLR's DIH.

What I only miss to date from SOLR is core-swapping; In case you want
a complete new re-index of data and empty your core forehand, all
searches between delete and index are empty. This means that you need
at least 2 cores while 1 is searched and the other re-indexed. SOLR
allows you to have swap 2 cores and thei indices. For ES I came to the
solution (thansk to the IRC group) to use 2 index and just swap the
aliases but this still feels less comfortable and secure especially as
I couldn't find out till to date how to get the index one alias points
to in java API.
Less secure?!. Really?. Aliasing gives you exactly that (don't know what core swapping is in Solr, but, I can guess). You can name an index my_index_xxx, alias it to my_index, and work against it. You can then index data into my_index_yyy, and once its ready, have a simple call to swap the aliases in an atomic fashion across the cluster, pointing the my_index alias to my_index_yyy.

I must stress this point: there is no way to do something in the REST API that you can't do in the Java API. The REST API is built on top of the Java API.

Also what is a slight problem is the fact that in case you index many
data into on index of ES the searches on other indexes gets sometimes
too slow - here a configurable maximum impact rate would be nice, so
one could make sure that at least e.g.: 50% of CPU and RAM of ES is
used on searches and not affected by index operations on other cores.
My guess is that what you mostly use is IO, not CPU or RAM. You can do throttling yourself, or, you can add more nodes into the cluster. Some other systems would want exactly the opposite. There are ways to try and help with that, for example, to be able to separate specific indices and make sure they are not allocated on the same nodes. Thats certainly on the roadmap.

my 2 cent

Best,

-Korbinian

On 21 Mrz., 03:51, Antonio Lobato aj.lob...@gmail.com wrote:

Hey all. I just wanted to share my experiences with elasticsearch on
a 10 node cluster and my thoughts. I've had a lot of fun using it,
and will make heavy use of it in the future, :slight_smile:

http://blog.jinked.com/posts/30

I wanted to take the time to address two things really quickly
though. To answer Shay's question, the trivial lacking areas in
elasticsearch where Solr has functionality that elasticsearch doesn't,
are things such as the contrib/data import handler. We actually do use
it in one case, but only because of a limitation in Solr that doesn't
exist in elasticsearch (NRT, to be exact). Moving over to
elasticsearch from Solr would eliminate our need for the data import
handler, as we can distribute/index this particular dataset in real
time.

The second thing I wanted to bring up was a bug. I've seen this
happen before with certain software (including software I've written
my self), so it's not too uncommon. elasticsearch will listen on
0.0.0.0 by default, but if the machine has two IP addresses, it will
actually listen on one IP address and send requests/multicast out on
the other. The result is elasticsearch will not become a member of
the cluster and will isolate it self on an island on that one
machine. Of course, one work around is to define the IP to bind to in
the config, but it'd certainly go a long way in ease of use if we can
just setup elasticsearch and go. In particular since we share a single
config file on a cluster file system.

Flippin' amazing software though. elasticsearch's cluster model is
how all multi-node software should function.

Cheers!
-Antonio


(Clinton Gormley) #5

Never got the fascination of some developers with writing complex
configuration trying to tell a software what to do, compared to
writing code that does that in half the time and in much more readable
format (code, and its your code). Writing a script / code that does
exactly what you want, with your (relational) data, and
controlling how to index it is much simpler then trying to understand
how a complex mapping tool works and how to configure it.

Yeah - I agree.

Thats my thinking. Of course, this does not mean that in the future,
you won't be able to write code that express how to index a database
and have it run as a river.

Yes, for some databases or ORMs, there will be a single standard
(relatively simple) way of configuring an import/data-sync. For these,
it'd be worth making specific rivers available.

However, you obviously aren't going to have the detailed knowledge
required for defining a river for each of these - this should come from
people who actually use these projects.

Perhaps, we need an easy-to-follow doc for: "how to create a custom
river plugin"

clint


(Clinton Gormley) #6

I couldn't find out till to date how to get the index one alias
points
to in java API.

I must stress this point: there is no way to do something in the REST
API that you can't do in the Java API. The REST API is built on top of
the Java API.

I think what KB means here is that finding out what an alias points to
is currently not obvious.

You need to retrieve the index status, and examine each index to check
for what aliases it has. It'd be nice to have a simple API to return
that information instead.

In the Perl API, I have $es->get_aliases which returns:

{
aliases: {
alias_1: ['index_1','index_2',],
alias_2: ['index_1'],
},
indices: {
index_1: ['alias_1','alias2'],
index_2: ['alias_1']
}
}

clint


(Shay Banon) #7

Getting the aliases should be done using the cluster state API.
On Monday, March 21, 2011 at 12:49 PM, Clinton Gormley wrote:

I couldn't find out till to date how to get the index one alias
points
to in java API.

I must stress this point: there is no way to do something in the REST
API that you can't do in the Java API. The REST API is built on top of
the Java API.

I think what KB means here is that finding out what an alias points to
is currently not obvious.

You need to retrieve the index status, and examine each index to check
for what aliases it has. It'd be nice to have a simple API to return
that information instead.

In the Perl API, I have $es->get_aliases which returns:

{
aliases: {
alias_1: ['index_1','index_2',],
alias_2: ['index_1'],
},
indices: {
index_1: ['alias_1','alias2'],
index_2: ['alias_1']
}
}

clint


(Shay Banon) #8

Perhaps, we need an easy-to-follow doc for: "how to create a custom
river plugin"

Note, in this case, the fact that its a river is really not that important. The only thing that you can from the fact that its a river is the fact that elasticsearch will manage it and run it on one of the nodes in the cluster.
On Monday, March 21, 2011 at 12:41 PM, Clinton Gormley wrote:

Never got the fascination of some developers with writing complex
configuration trying to tell a software what to do, compared to
writing code that does that in half the time and in much more readable
format (code, and its your code). Writing a script / code that does
exactly what you want, with your (relational) data, and
controlling how to index it is much simpler then trying to understand
how a complex mapping tool works and how to configure it.

Yeah - I agree.

Thats my thinking. Of course, this does not mean that in the future,
you won't be able to write code that express how to index a database
and have it run as a river.

Yes, for some databases or ORMs, there will be a single standard
(relatively simple) way of configuring an import/data-sync. For these,
it'd be worth making specific rivers available.

However, you obviously aren't going to have the detailed knowledge
required for defining a river for each of these - this should come from
people who actually use these projects.

Perhaps, we need an easy-to-follow doc for: "how to create a custom
river plugin"
clint


(K.B.) #9

On 21 Mrz., 11:32, Shay Banon shay.ba...@elasticsearch.com wrote:

What I only miss to date from SOLR is core-swapping; In case you want
a complete new re-index of data and empty your core forehand, all
searches between delete and index are empty. This means that you need
at least 2 cores while 1 is searched and the other re-indexed. SOLR
allows you to have swap 2 cores and thei indices. For ES I came to the
solution (thansk to the IRC group) to use 2 index and just swap the
aliases but this still feels less comfortable and secure especially as
I couldn't find out till to date how to get the index one alias points
to in java API.

Less secure?!. Really?. Aliasing gives you exactly that (don't know what core swapping is in Solr, but, I can guess). You can name an index my_index_xxx, alias it to my_index, and work against it. You can then index data into my_index_yyy, and once its ready, have a simple call to swap the aliases in an atomic fashion across the cluster, pointing the my_index alias to my_index_yyy.

you got me wrong here: with less secure I mean that there are 2 parts
of a system needed to make sure the transition works while the SWAP in
SOLR was introduced to make the operation atomic (-> 2 atomic
operations: 1st. index, 2nd. swap) - idea is that the controlling app
also can crash/ lock/ go down unwanted; However, swap in SOLR is quite
similar to the index-aliasing, just that it does the storing of
information regarding it internal (simplified said);

I must stress this point: there is no way to do something in the REST API that you can't do in the Java API. The REST API is built on top of the Java API.

yeah, I know, but the "finding the right way" to do things is quite
hard;

later down you write: "Getting the aliases should be done using the
cluster state API. "

this is something I would never have thought to look at - I asked
Clinton in ICQ about this and all he could say that it sure has to
work but he doesn't know it in java api but in perl ... (see his
answer);

And even now I now know to look at cluster() in JavaApi all I can see
there are a dozen functions relating to nodes in there - e.g:

client.admin().cluster().?

while client.admin().indices(). has many index-related functions, but
I weren't able to find any way to get the aliases and their mappings;

Also what is a slight problem is the fact that in case you index many
data into on index of ES the searches on other indexes gets sometimes
too slow - here a configurable maximum impact rate would be nice, so
one could make sure that at least e.g.: 50% of CPU and RAM of ES is
used on searches and not affected by index operations on other cores.

My guess is that what you mostly use is IO, not CPU or RAM. You can do throttling yourself, or, you can add more nodes into the cluster. Some other systems would want exactly the opposite. There are ways to try and help with that, for example, to be able to separate specific indices and make sure they are not allocated on the same nodes. Thats certainly on the roadmap.

great to hear :slight_smile:

Best,

-K

my 2 cent

Best,

-Korbinian

On 21 Mrz., 03:51, Antonio Lobato aj.lob...@gmail.com wrote:

Hey all. I just wanted to share my experiences with elasticsearch on
a 10 node cluster and my thoughts. I've had a lot of fun using it,
and will make heavy use of it in the future, :slight_smile:

http://blog.jinked.com/posts/30

I wanted to take the time to address two things really quickly
though. To answer Shay's question, the trivial lacking areas in
elasticsearch where Solr has functionality that elasticsearch doesn't,
are things such as the contrib/data import handler. We actually do use
it in one case, but only because of a limitation in Solr that doesn't
exist in elasticsearch (NRT, to be exact). Moving over to
elasticsearch from Solr would eliminate our need for the data import
handler, as we can distribute/index this particular dataset in real
time.

The second thing I wanted to bring up was a bug. I've seen this
happen before with certain software (including software I've written
my self), so it's not too uncommon. elasticsearch will listen on
0.0.0.0 by default, but if the machine has two IP addresses, it will
actually listen on one IP address and send requests/multicast out on
the other. The result is elasticsearch will not become a member of
the cluster and will isolate it self on an island on that one
machine. Of course, one work around is to define the IP to bind to in
the config, but it'd certainly go a long way in ease of use if we can
just setup elasticsearch and go. In particular since we share a single
config file on a cluster file system.

Flippin' amazing software though. elasticsearch's cluster model is
how all multi-node software should function.

Cheers!
-Antonio


(Karussell) #10

client.admin().cluster().?

while client.admin().indices(). has many index-related functions, but
I weren't able to find any way to get the aliases and their mappings;

a bit hidden but CTRL + SPACE and http://elasticsearch.karmi.cz is
always your friend :wink:

client.admin().cluster().state(new
ClusterStateRequest()).actionGet().getState().getMetaData().aliases()

client.admin().indices().aliases(new
IndicesAliasesRequest().addAlias(indexName, alias)).actionGet();

Regards,
Peter.

--

http://jetwick.com - Open Twitter Search


(K.B.) #11

Hello Peter,

that gives me all Aliases as a Set but not the way I can say
aliasN is bount to indexN; All I can see is to get the
index("index").aliases() to get all aliases of a String. In case I
have many indices and want to see where myAlias is mapped to, I need
to iterate over all indices and their aliases just to find out how it
is mapped.

Best

On 22 Mrz., 00:50, Karussell tableyourt...@googlemail.com wrote:

client.admin().cluster().?

while client.admin().indices(). has many index-related functions, but
I weren't able to find any way to get the aliases and their mappings;

a bit hidden but CTRL + SPACE andhttp://elasticsearch.karmi.czis
always your friend :wink:

client.admin().cluster().state(new
ClusterStateRequest()).actionGet().getState().getMetaData().aliases()

client.admin().indices().aliases(new
IndicesAliasesRequest().addAlias(indexName, alias)).actionGet();

Regards,
Peter.

--

http://jetwick.com- Open Twitter Search


(system) #12