ES Failures and Recovery

Anand_Nalya · June 11, 2013, 2:41pm

Hi,

We are using 0.90.1 version ES and are planning for high availability
testing. While the entire scheme to enable the cluster to be highly
available is clear, I wanted to get some idea about ES Service lifetime in
terms of Mean-Time to Failure and Time of Recovery in cases of failure. Any
historic evidences will also help, as it will be vital for us to calculate
the actual availability of the system across an year.

While I understand that ES provides seamless high availability through
replication, but any failure, will impact the performance to some extent
and this calculation will help in deriving the actual number of nodes that
we should consider without compromising on the performance as well, while
the system is available.

Any ideas/facts would be very helpful .

Thanks,
Anand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

David_G_Ortega · June 12, 2013, 8:23am

Hi Anand,

difficult to answer. At least I can not. Depends so much in the number of
shards, replicas and mappings that you have in your cluster.
If you shard and one or several nodes are down the cluster is going to be
at least yellow and sometimes you will have to restart the whole cluster.

In my experience sharding is bad. Sharding is as in any other solution the
last thing you would do.

Highly dinamic mappings make ES start indexes painfully slow. What do I
mean with this? I mean objects with dinamic properties like:

//one item
{"sort":{"user123":"onevalue"}}

//other item
{"sort":{"user122":"onevalue"}}

Defining that dinamyc field userxxx is making the mapping huge and the
start really slow, If your data has structures like that change them.

For me three recommendations are:

Try to not shard if the index does not need it.
Try to have always the same number of replicas and shards between the
indexes.
Avoid dinamyc mappings.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Anand_Nalya · June 12, 2013, 9:24am

Hi David,

Thanks for the reply.

We're planning to deploy Es without replication and around 10 shards per
index. Also, I'm disabling auto index creation and will create mapping
manually for each new index .

Also, you said that sharding should be the last resort. The throughput that
I need cannot be served with a single shard per index. Is there some other
way to achieve parallelism in ES other than sharding?

Thanks,
Anand

On 12 June 2013 13:53, David G Ortega g.ortega.david@gmail.com wrote:

Hi Anand,

difficult to answer. At least I can not. Depends so much in the number of
shards, replicas and mappings that you have in your cluster.
If you shard and one or several nodes are down the cluster is going to be
at least yellow and sometimes you will have to restart the whole cluster.

In my experience sharding is bad. Sharding is as in any other solution the
last thing you would do.

Highly dinamic mappings make ES start indexes painfully slow. What do I
mean with this? I mean objects with dinamic properties like:

//one item
{"sort":{"user123":"onevalue"}}

//other item
{"sort":{"user122":"onevalue"}}

Defining that dinamyc field userxxx is making the mapping huge and the
start really slow, If your data has structures like that change them.

For me three recommendations are:

Try to not shard if the index does not need it.

Try to have always the same number of replicas and shards between the
indexes.

Avoid dinamyc mappings.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/65OWHHgZ1WA/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

David_G_Ortega · June 12, 2013, 9:54am

Yes, ES is multitenant so splitting the data in several indexes.

Many users have used ES as a log central and the best approach has been
splitting the data (also because the achiving needs)
Said this I have only pointed that sharding should only be done just in
case your data needs it.
ES creates 5 shards by default and It does not look right to me. Small
indexes are going to result in much more prone to errors clusters with this
configuration.

The multitenant capability of ES makes this possible:

http://localhost:9200/node1,node2/_search

The diff basically in your case would be that your cluster with ten indexes
is going to be more robust as far as I have experienced since I had
sometimes to restart the whole cluster. Another diff is that you can route
your search directly asking the node you want instead of having to route
especifically.
In super fast apps you could just query one node to show results while
queriying the whole set of nodes and then update...

Try both and share your experience

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Anand_Nalya · June 12, 2013, 4:08pm

Hi David,

We are already creating separate indices for different time windows, what
we were looking for is a way to achieve parallel writes within a given
window, hence using ES shards.

Also, regarding the whole ES cluster restarts, are there any specific
scenarios that triggers this. I've also seen some case when a single ES
node fails with OOM and then refuses to come up, terminating the whole
cluster.

Also, is there some guideline to calculate the total time to restart the
cluster on the basis of data/index size.

Thanks,
Anand

On Wednesday, 12 June 2013 15:24:50 UTC+5:30, David G Ortega wrote:

Yes, ES is multitenant so splitting the data in several indexes.

Many users have used ES as a log central and the best approach has been
splitting the data (also because the achiving needs)
Said this I have only pointed that sharding should only be done just in
case your data needs it.
ES creates 5 shards by default and It does not look right to me. Small
indexes are going to result in much more prone to errors clusters with this
configuration.

The multitenant capability of ES makes this possible:

http://localhost:9200/node1,node2/_search

The diff basically in your case would be that your cluster with ten
indexes is going to be more robust as far as I have experienced since I had
sometimes to restart the whole cluster. Another diff is that you can route
your search directly asking the node you want instead of having to route
especifically.
In super fast apps you could just query one node to show results while
queriying the whole set of nodes and then update...

Try both and share your experience

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

David_G_Ortega · June 13, 2013, 9:01am

"Also, is there some guideline to calculate the total time to restart the
cluster on the basis of data/index size."

I don't think the recovery time has to be that much with the index size, at
Lucene level at least, at ES probably is doing some kind of data integrity
between nodes, Kimchy would be the best to answer that.
In my experience the recovery time was afected by the size of the mapping
and the cluster configuration. I think that the workflow of ES is something
like:

node start (watch the plugins some take a while to start)
cluster discovery
recover indexes and mappings
data exchange between nodes (here size is important)

Can I give you the Big O of this? Im afraid to say no, at least at the
moment but you have gave me a great task

If you are so concerned about it maybe you could set up another cluster and
push into it...

For me the OOM has been always ended restarting all but the master nodes in
the cluster... How much data are you handling?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Venu_Nayar · July 21, 2014, 6:15pm

Did you do a failure analysis and what were your findings?

Thanks,
Venu

On Tuesday, June 11, 2013 7:41:57 AM UTC-7, Anand Nalya wrote:

Hi,

We are using 0.90.1 version ES and are planning for high availability
testing. While the entire scheme to enable the cluster to be highly
available is clear, I wanted to get some idea about ES Service lifetime in
terms of Mean-Time to Failure and Time of Recovery in cases of failure. Any
historic evidences will also help, as it will be vital for us to calculate
the actual availability of the system across an year.

While I understand that ES provides seamless high availability through
replication, but any failure, will impact the performance to some extent
and this calculation will help in deriving the actual number of nodes that
we should consider without compromising on the performance as well, while
the system is available.

Any ideas/facts would be very helpful .

Thanks,
Anand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5f3c823f-0b54-4213-ac45-3dfa2f0b9af3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Elasticsearch high availability Elasticsearch	8	378	July 6, 2017
Stability of latest Elasticsearch? Elasticsearch	7	1022	July 5, 2017
Dealing with node failures Elasticsearch	1	1971	May 11, 2018
Choosing Shards and Replica's configuration values Elasticsearch	11	674	July 6, 2017
Case studies of successful ES clusters in production Elasticsearch	5	691	July 5, 2017

ES Failures and Recovery

Related topics