Embedding elasticsearch in an auto-scaling application

I'm looking to switch to ElasticSearch for the search engine inside
one of our Java server applications (a real-time news aggregator), but
I haven't been able to determine whether it is suitable for embedding
as part of a cluster when any member (or number of members) may be
added or removed at anytime (via AWS elastic load balancing). In this
scenario the requirements are:

  1. An embedded node can be created on server startup, and joins the
    cluster
  2. The node will take part in indexing and query execution while it is
    alive (i.e. not just a remote client)
  3. The node can be shut down at any time, and cannot be depended on to
    ever revive once it shuts down
  4. There is (at least) one master node that does all the actual
    indexing, and always has a complete index

Is this setup possible (or efficient) with ES, and if so are there any
good references for configuring it in this way?

(Ideally we would just have a separate ES cluster that the servers
hit, but the deployment requirements for this particular application
dictate that everything needs to be self-contained in one Java
application.)

Jeremy

This is precisely the architecture we use at www.penpalkidsclub.com and our
soon to launch entrepreneurship portal. There are no dedicated instances.

Good to know it's doable. Can you tell me where I can start looking
to configure it this way? I'm having some trouble finding
documentation on the node configuration options.

-j

On Dec 16, 11:08 pm, James Cook jc...@pykl.com wrote:

This is precisely the architecture we use atwww.penpalkidsclub.comand our
soon to launch entrepreneurship portal. There are no dedicated instances.

On Sat, Dec 17, 2011 at 12:08 AM, James Cook jcook@pykl.com wrote:

This is precisely the architecture we use at www.penpalkidsclub.com and our
soon to launch entrepreneurship portal. There are no dedicated instances.

I'm still getting the hang of everything that AWS can do, but what
happens if all of your Elastic Search nodes shut down due to a bug or
other issue?

I ask, because it sounds from your message like all of your instances
are using Ephemeral storage...

Thanks,

-- Doug

Hi Doug,

All nodes are using ephemeral storage. We use the AWS S3 gateway for the
long-term persistance of our data. If all nodes go down the data is safe in
S3. When we restart our cluster, the nodes recover their state from S3.

-- jim

Hi Jeremy,

You can start
here: http://www.elasticsearch.org/tutorials/2011/08/22/elasticsearch-on-ec2.html

That pretty much documents my configuration.

-- jim

On Mon, Dec 19, 2011 at 2:22 PM, James Cook jcook@pykl.com wrote:

Hi Doug,

All nodes are using ephemeral storage. We use the AWS S3 gateway for the
long-term persistance of our data. If all nodes go down the data is safe in
S3. When we restart our cluster, the nodes recover their state from S3.

Ah, so that then brings me to my next question. When S3 is used, how
up to date are the indexes? My experience with S3 is that it's great
for data storage, but just not in real time. (EBS is better for
that...) I've considered using S3 with my setup, but I'm nervous that
I might have an outage and lose even a few minutes of indexed data.

Thanks (again!),

-- Doug

The S3 gateway is asynchronous, so it does do periodic writes to S3. It is
also possible to lose data that might be in the transaction log since the
last sync. The downside is a potential loss of data with a cluster failure
if ES is your only persistence mechanism, but the gateway data should never
be in an inconsistent state.

A quick glance at the code seems to indicate that the gateways are written
to when a cluster state change is detected. Perhaps Shay can give us more
information about the internals.

I use Hazelcast as a memcache layer to ES, so I have the alternative to
write all updates to ES and another store (like MySQL). Not optimum, but
it allows me to pick and choose which data is most important to me and
gives me a "playback" option if a catastrophic failure occurs.

The shared gateway is snapshotted in an interval, setting
is index.gateway.snapshot_interval and defaults to 10s.

On Tue, Dec 20, 2011 at 10:51 PM, James Cook jcook@pykl.com wrote:

The S3 gateway is asynchronous, so it does do periodic writes to S3. It is
also possible to lose data that might be in the transaction log since the
last sync. The downside is a potential loss of data with a cluster failure
if ES is your only persistence mechanism, but the gateway data should never
be in an inconsistent state.

A quick glance at the code seems to indicate that the gateways are written
to when a cluster state change is detected. Perhaps Shay can give us more
information about the internals.

I use Hazelcast as a memcache layer to ES, so I have the alternative to
write all updates to ES and another store (like MySQL). Not optimum, but
it allows me to pick and choose which data is most important to me and
gives me a "playback" option if a catastrophic failure occurs.

Thanks for the info James. One more configuration issue I am running
up against - the primary indexing server is in a separate (non-AWS)
datacenter, which will feed the AWS elasticsearch cluster. Is it
possible to keep a full index on the primary, which distributes index
changes to AWS, but will not handle any queries for the cluster (due
to latency issues)?

On Dec 19 2011, 1:23 pm, James Cook jc...@pykl.com wrote:

Hi Jeremy,

You can start
here:Elasticsearch Platform — Find real-time answers at scale | Elastic...

That pretty much documents my configuration.

-- jim

In elasticsearch, both the primary shard and the replica perform indexing.

On Fri, Jan 6, 2012 at 7:18 PM, Jeremy Jongsma jjongsma@gmail.com wrote:

Thanks for the info James. One more configuration issue I am running
up against - the primary indexing server is in a separate (non-AWS)
datacenter, which will feed the AWS elasticsearch cluster. Is it
possible to keep a full index on the primary, which distributes index
changes to AWS, but will not handle any queries for the cluster (due
to latency issues)?

On Dec 19 2011, 1:23 pm, James Cook jc...@pykl.com wrote:

Hi Jeremy,

You can start
here:
Elasticsearch Platform — Find real-time answers at scale | Elastic...

That pretty much documents my configuration.

-- jim