I'm looking to switch to ElasticSearch for the search engine inside
one of our Java server applications (a real-time news aggregator), but
I haven't been able to determine whether it is suitable for embedding
as part of a cluster when any member (or number of members) may be
added or removed at anytime (via AWS elastic load balancing). In this
scenario the requirements are:
An embedded node can be created on server startup, and joins the
cluster
The node will take part in indexing and query execution while it is
alive (i.e. not just a remote client)
The node can be shut down at any time, and cannot be depended on to ever revive once it shuts down
There is (at least) one master node that does all the actual
indexing, and always has a complete index
Is this setup possible (or efficient) with ES, and if so are there any
good references for configuring it in this way?
(Ideally we would just have a separate ES cluster that the servers
hit, but the deployment requirements for this particular application
dictate that everything needs to be self-contained in one Java
application.)
Good to know it's doable. Can you tell me where I can start looking
to configure it this way? I'm having some trouble finding
documentation on the node configuration options.
All nodes are using ephemeral storage. We use the AWS S3 gateway for the
long-term persistance of our data. If all nodes go down the data is safe in
S3. When we restart our cluster, the nodes recover their state from S3.
On Mon, Dec 19, 2011 at 2:22 PM, James Cook jcook@pykl.com wrote:
Hi Doug,
All nodes are using ephemeral storage. We use the AWS S3 gateway for the
long-term persistance of our data. If all nodes go down the data is safe in
S3. When we restart our cluster, the nodes recover their state from S3.
Ah, so that then brings me to my next question. When S3 is used, how
up to date are the indexes? My experience with S3 is that it's great
for data storage, but just not in real time. (EBS is better for
that...) I've considered using S3 with my setup, but I'm nervous that
I might have an outage and lose even a few minutes of indexed data.
The S3 gateway is asynchronous, so it does do periodic writes to S3. It is
also possible to lose data that might be in the transaction log since the
last sync. The downside is a potential loss of data with a cluster failure
if ES is your only persistence mechanism, but the gateway data should never
be in an inconsistent state.
A quick glance at the code seems to indicate that the gateways are written
to when a cluster state change is detected. Perhaps Shay can give us more
information about the internals.
I use Hazelcast as a memcache layer to ES, so I have the alternative to
write all updates to ES and another store (like MySQL). Not optimum, but
it allows me to pick and choose which data is most important to me and
gives me a "playback" option if a catastrophic failure occurs.
The shared gateway is snapshotted in an interval, setting
is index.gateway.snapshot_interval and defaults to 10s.
On Tue, Dec 20, 2011 at 10:51 PM, James Cook jcook@pykl.com wrote:
The S3 gateway is asynchronous, so it does do periodic writes to S3. It is
also possible to lose data that might be in the transaction log since the
last sync. The downside is a potential loss of data with a cluster failure
if ES is your only persistence mechanism, but the gateway data should never
be in an inconsistent state.
A quick glance at the code seems to indicate that the gateways are written
to when a cluster state change is detected. Perhaps Shay can give us more
information about the internals.
I use Hazelcast as a memcache layer to ES, so I have the alternative to
write all updates to ES and another store (like MySQL). Not optimum, but
it allows me to pick and choose which data is most important to me and
gives me a "playback" option if a catastrophic failure occurs.
Thanks for the info James. One more configuration issue I am running
up against - the primary indexing server is in a separate (non-AWS)
datacenter, which will feed the AWS elasticsearch cluster. Is it
possible to keep a full index on the primary, which distributes index
changes to AWS, but will not handle any queries for the cluster (due
to latency issues)?
On Dec 19 2011, 1:23 pm, James Cook jc...@pykl.com wrote:
In elasticsearch, both the primary shard and the replica perform indexing.
On Fri, Jan 6, 2012 at 7:18 PM, Jeremy Jongsma jjongsma@gmail.com wrote:
Thanks for the info James. One more configuration issue I am running
up against - the primary indexing server is in a separate (non-AWS)
datacenter, which will feed the AWS elasticsearch cluster. Is it
possible to keep a full index on the primary, which distributes index
changes to AWS, but will not handle any queries for the cluster (due
to latency issues)?
On Dec 19 2011, 1:23 pm, James Cook jc...@pykl.com wrote:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.