We're looking on deploying Elastic Search into EC2 to power the search of
our new product. After crawling elasticsearch.org for tutorials, I found
the EC2 tutorial, which was quite helpful, but I didn't find any guides how
I should implement high availability for elastic search.
All our current production systems are configured and deployed so that any
machine can fail and all traffic, requests etc are directed to the working
nodes and a possible failover procedures have also been fully automated.
This results in high availability without any human intervention (the
techops team does monitoring, but they don't actually need to do much,
because the high amount of automation)
Now I'm thinking how I can do all this to Elastic Search, but I don't have
any good pointers. Now I know that elastic search is powered by lucene and
the indices are distributed into shards, which each have one master and one
or more replicas. What if a master fails? I found some info about master
election and also about the EC2 discovery module
at http://www.elasticsearch.org/guide/reference/modules/discovery/ but that
doesn't really tell me what happens when the master dies. Is there
additional documentation?
What about the RESTful endpoint? What if the machine running this dies? We
are running haproxy in our production environment, so I could very well use
that in front, but I couldn't find any good guides on that topic either. Do
I have to configure some scripts which changes DNS CNAMEs to the new
master, or can/should I use EC2 elastic IP addresses?
All responses and links to relevant tutorials are greatly appreciated.
Elasticsearch is "highly available" by design and by default. As you
write, shards are replicated across the cluster for better performance and
availability. If a primary shard is not available (node goes down, etc),
one replica will take over the role of the primary.
Elasticsearch clusters, again by default, "take care of themselves".
Either with multicast or properly configured unicast topologies, a new node
with a correct cluster.name settings will join the cluster, and will
start serving queries, storing shards, etc.
Elasticsearch clusters don't have a SPOF master node: any node which is
configured to do so (which is a default) can become a master. Thus, if a
master goes down, another node will take over the role.
Given you have enough nodes to place replicas on, your cluster is highly
available by default. When a node goes down, unless you're running at
capacity, you just launch another one, and it will take on the duties,
without human intervention — apart from monitoring.
Your HTTP requests are automatically re-directed across the nodes in the
cluster. If you use a single IP in your application code, though, that is a
point of failure if that specific machine would go down. There are
multiple approaches how to solve it. Some libraries automatically perform
requests in a round-robin fashion and detect healthy nodes. Another
approach is to launch an elasticsearch "client node" which would serve as a
proxy, automatically discovering new nodes in the cluster, etc. You could
also use Nginx or HAproxy in a similar fashion, making sure you
periodically update the list of nodes in its configuration.
EC2 elastic IPs are not suited well for the purpose, since you are billed
the traffic and it's using the external network channel (unless I'm
mistaken or the rules changed). EC load balancer (ELB) has the same
characteristics, unfortunately, in addition to being overloaded/not
resilient enough at times. A dedicated Nginx/HAproxy/Pound/etc proxy
machine is a much better solution from the architectural point of view.
Karel
On Tuesday, December 11, 2012 12:39:39 PM UTC+1, Juho Mäkinen wrote:
We're looking on deploying Elastic Search into EC2 to power the search of
our new product. After crawling elasticsearch.org for tutorials, I found
the EC2 tutorial, which was quite helpful, but I didn't find any guides how
I should implement high availability for Elasticsearch.
All our current production systems are configured and deployed so that any
machine can fail and all traffic, requests etc are directed to the working
nodes and a possible failover procedures have also been fully automated.
This results in high availability without any human intervention (the
techops team does monitoring, but they don't actually need to do much,
because the high amount of automation)
Now I'm thinking how I can do all this to Elastic Search, but I don't have
any good pointers. Now I know that Elasticsearch is powered by lucene and
the indices are distributed into shards, which each have one master and one
or more replicas. What if a master fails? I found some info about master
election and also about the EC2 discovery module at Elasticsearch Platform — Find real-time answers at scale | Elastic but that
doesn't really tell me what happens when the master dies. Is there
additional documentation?
What about the RESTful endpoint? What if the machine running this dies? We
are running haproxy in our production environment, so I could very well use
that in front, but I couldn't find any good guides on that topic either. Do
I have to configure some scripts which changes DNS CNAMEs to the new
master, or can/should I use EC2 elastic IP addresses?
All responses and links to relevant tutorials are greatly appreciated.
Elasticsearch is "highly available" by design and by default. As you
write, shards are replicated across the cluster for better performance and
availability. If a primary shard is not available (node goes down, etc),
one replica will take over the role of the primary.
Elasticsearch clusters, again by default, "take care of themselves".
Either with multicast or properly configured unicast topologies, a new node
with a correct cluster.name settings will join the cluster, and will
start serving queries, storing shards, etc.
Elasticsearch clusters don't have a SPOF master node: any node which is
configured to do so (which is a default) can become a master. Thus, if a
master goes down, another node will take over the role.
Given you have enough nodes to place replicas on, your cluster is highly
available by default. When a node goes down, unless you're running at
capacity, you just launch another one, and it will take on the duties,
without human intervention — apart from monitoring.
Your HTTP requests are automatically re-directed across the nodes in the
cluster. If you use a single IP in your application code, though, that is a
point of failure if that specific machine would go down. There are
multiple approaches how to solve it. Some libraries automatically perform
requests in a round-robin fashion and detect healthy nodes. Another
approach is to launch an elasticsearch "client node" which would serve as a
proxy, automatically discovering new nodes in the cluster, etc. You could
also use Nginx or HAproxy in a similar fashion, making sure you
periodically update the list of nodes in its configuration.
EC2 elastic IPs are not suited well for the purpose, since you are
billed the traffic and it's using the external network channel (unless I'm
mistaken or the rules changed). EC load balancer (ELB) has the same
characteristics, unfortunately, in addition to being overloaded/not
resilient enough at times. A dedicated Nginx/HAproxy/Pound/etc proxy
machine is a much better solution from the architectural point of view.
Karel
On Tuesday, December 11, 2012 12:39:39 PM UTC+1, Juho Mäkinen wrote:
We're looking on deploying Elastic Search into EC2 to power the search of
our new product. After crawling elasticsearch.org for tutorials, I found
the EC2 tutorial, which was quite helpful, but I didn't find any guides how
I should implement high availability for Elasticsearch.
All our current production systems are configured and deployed so that
any machine can fail and all traffic, requests etc are directed to the
working nodes and a possible failover procedures have also been fully
automated. This results in high availability without any human intervention
(the techops team does monitoring, but they don't actually need to do much,
because the high amount of automation)
Now I'm thinking how I can do all this to Elastic Search, but I don't
have any good pointers. Now I know that Elasticsearch is powered by lucene
and the indices are distributed into shards, which each have one master and
one or more replicas. What if a master fails? I found some info about
master election and also about the EC2 discovery module at Elasticsearch Platform — Find real-time answers at scale | Elastic but that
doesn't really tell me what happens when the master dies. Is there
additional documentation?
What about the RESTful endpoint? What if the machine running this dies?
We are running haproxy in our production environment, so I could very well
use that in front, but I couldn't find any good guides on that topic
either. Do I have to configure some scripts which changes DNS CNAMEs to the
new master, or can/should I use EC2 elastic IP addresses?
All responses and links to relevant tutorials are greatly appreciated.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.