How persistence works in ElasticSearch


(Berkay Mollamustafaoglu-2) #1

I can use some help verifying/understanding how persistence works. Here is
my understanding of how it works:

Regardless of whether the index is stored in memory or file system, it is
considered temporary and removed when the node is stopped, hence if all the
nodes in the cluster stop, indices would be lost.

As such, for persistence write behind gateway needs to be used. Gateway
keeps a transaction log and (periodically?) creates indices. If all the
nodes in the cluster were stopped and restarting, the indices and the
transaction logs created by the gateway are used to recreate node indices.

Is this right?

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Fri, Mar 26, 2010 at 1:10 PM, Tim Robertson timrobertson100@gmail.comwrote:

Thanks Shay,

So... reading between the lines, does it then use protobufs (or other?) for
RPC instead of JSON and serializing and deserializing?

Cheers
Tim

On Fri, Mar 26, 2010 at 3:58 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

You won't enjoy locally between elasticsearch and hadoop in any case
since both use different distribution model. The locality would only make
sense for the indexing part, and think that you probably won't really need
it (it should be fast enough).

What language are you going to write your jobs at? If Java, then make
use of the native Java client (obtained from a "non data" Server started)
and not HTTP. More here:
http://www.elasticsearch.com/docs/elasticsearch/java_api/client/#Server_Client

-shay.banon

On Fri, Mar 26, 2010 at 5:35 PM, Tim Robertson <timrobertson100@gmail.com

wrote:

Hey,

Is anyone building their indexes using Hadoop? If so, are they deploying
ES across the same cluster as Hadoop and trying to reduce network noise by
making use of data locality, or keeping the clusters separate and just
calling over HTTP from MapReduce when building the indexes? I am about to
set up on EC2, and planned to keep the search and processing machines
separate.

Cheers,
Tim


(Shay Banon) #2

Yes, thats basically how it works. Regarding the transaction log, basically,
the gateway is responsible for mirroring the current shard lucene index, and
the delta transaction log. When a commit occurs (either through an API call,
or automatically by elasticsearch), the transaction log is flushed.

The benefit of this architecture is that the indexable state of the cluster
can be written in an async manner, and, the actual storage of the index is
irrelevant for long term persistency, which means you can still store the
index in memory (or just parts of it, with the upcoming cacheable FS
storage), and not loose it on failure.

-shay.banon

On Fri, Mar 26, 2010 at 11:57 PM, Berkay Mollamustafaoglu <mberkay@gmail.com

wrote:

I can use some help verifying/understanding how persistence works. Here is
my understanding of how it works:

Regardless of whether the index is stored in memory or file system, it is
considered temporary and removed when the node is stopped, hence if all the
nodes in the cluster stop, indices would be lost.

As such, for persistence write behind gateway needs to be used. Gateway
keeps a transaction log and (periodically?) creates indices. If all the
nodes in the cluster were stopped and restarting, the indices and the
transaction logs created by the gateway are used to recreate node indices.

Is this right?

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Fri, Mar 26, 2010 at 1:10 PM, Tim Robertson timrobertson100@gmail.comwrote:

Thanks Shay,

So... reading between the lines, does it then use protobufs (or other?)
for RPC instead of JSON and serializing and deserializing?

Cheers
Tim

On Fri, Mar 26, 2010 at 3:58 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Hi,

You won't enjoy locally between elasticsearch and hadoop in any case
since both use different distribution model. The locality would only make
sense for the indexing part, and think that you probably won't really need
it (it should be fast enough).

What language are you going to write your jobs at? If Java, then make
use of the native Java client (obtained from a "non data" Server started)
and not HTTP. More here:
http://www.elasticsearch.com/docs/elasticsearch/java_api/client/#Server_Client

-shay.banon

On Fri, Mar 26, 2010 at 5:35 PM, Tim Robertson <
timrobertson100@gmail.com> wrote:

Hey,

Is anyone building their indexes using Hadoop? If so, are they
deploying ES across the same cluster as Hadoop and trying to reduce network
noise by making use of data locality, or keeping the clusters separate and
just calling over HTTP from MapReduce when building the indexes? I am about
to set up on EC2, and planned to keep the search and processing machines
separate.

Cheers,
Tim


(system) #3