Upgrades causing Elastic Search downtime

Hello,

We've upgraded Elastic Search twice over the last month and have
experienced downtime (roughly 8 minutes) during the roll out. I'm not sure
if it something we are doing wrong or not.

We use EC2 instances for our Elastic Search cluster and cloud formation to
manage our stack. When we deploy a new version or change to Elastic Search
we upload the new artefact, double the number of EC2 instances and wait for
the new instances to join the cluster.

For example 6 nodes form a cluster on v 0.90.7. We upload the 0.90.9
version via our deployment process and double the number nodes for the
cluster (12). The 6 new nodes will join the cluster with the 0.90.9
version.

We then want to remove each of the 0.90.7 nodes. We do this by shutting
down the node (using the plugin head), wait for the cluster to rebalance
the shards and then terminate the EC2 instances. Then repeat with the next
node. We leave the master node until last so that it does the re-election
just once.

The issue we have found in the last two upgrades is that while the
penultimate node is shutting down the master starts throwing errors and the
cluster goes red. To fix this we've stopped the Elastic Search process on
master and have had to restart each of the other nodes (though perhaps they
would have rebalanced themselves in a longer time period?). We find that we
send an increase error response to our clients during this time.

We've set out queue size for search to 300 and we start to see the queue
gets full:
at java.lang.Thread.run(Thread.java:724)
2014-01-07 15:58:55,508 DEBUG action.search.type [Matt Murdock]
[92036651] Failed to execute fetch phase
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException:
rejected execution (queue capacity 300) on
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2@23f1bc3
at
org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:61)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)

But also we see the following error which we've been unable to find the
diagnosis for:
2014-01-07 15:58:55,530 DEBUG index.shard.service [Matt Murdock]
[index-name][4] Can not build 'doc stats' from engine shard state
[RECOVERING]
org.elasticsearch.index.shard.IllegalIndexShardStateException:
[index-name][4] CurrentState[RECOVERING] operations only allowed when
started/relocated
at
org.elasticsearch.index.shard.service.InternalIndexShard.readAllowed(InternalIndexShard.java:765)

Are we doing anything wrong or has anyone experienced this?

Thanks,
Jenny

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b2328296-e9c9-4763-b61b-6ad2e145e59b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Almost elasticsearch should support clusters of nodes with different minor
versions, I have seen issues between minor versions. Version 0.90.8 did
contain an upgrade of Lucene (4.6), but that does not look like it would
cause your issue. You could look at the github issues tagged 0.90.[8-9] and
see if something applies in your case.

A couple of points about upgrading:

If you want to use the double-the-nodes techniques (which should not be
necessary for minor version upgrades), you could "decommission" a node
using the Shard API. Here is a good writeup:

Since you doubled the amount of nodes in the cluster,
the minimum_master_nodes setting would be temporarily incorrect and
potential split-brain clusters might occur. In fact, it might have occurred
in your case since the cluster state seems incorrect. Merely hypothesizing.

Cheers,

Ivan

On Tue, Jan 7, 2014 at 9:26 AM, Jenny Sivapalan <
jennifer.sivapalan@gmail.com> wrote:

Hello,

We've upgraded Elastic Search twice over the last month and have
experienced downtime (roughly 8 minutes) during the roll out. I'm not sure
if it something we are doing wrong or not.

We use EC2 instances for our Elastic Search cluster and cloud formation to
manage our stack. When we deploy a new version or change to Elastic Search
we upload the new artefact, double the number of EC2 instances and wait for
the new instances to join the cluster.

For example 6 nodes form a cluster on v 0.90.7. We upload the 0.90.9
version via our deployment process and double the number nodes for the
cluster (12). The 6 new nodes will join the cluster with the 0.90.9
version.

We then want to remove each of the 0.90.7 nodes. We do this by shutting
down the node (using the plugin head), wait for the cluster to rebalance
the shards and then terminate the EC2 instances. Then repeat with the next
node. We leave the master node until last so that it does the re-election
just once.

The issue we have found in the last two upgrades is that while the
penultimate node is shutting down the master starts throwing errors and the
cluster goes red. To fix this we've stopped the Elastic Search process on
master and have had to restart each of the other nodes (though perhaps they
would have rebalanced themselves in a longer time period?). We find that
we send an increase error response to our clients during this time.

We've set out queue size for search to 300 and we start to see the queue
gets full:
at java.lang.Thread.run(Thread.java:724)
2014-01-07 15:58:55,508 DEBUG action.search.type [Matt Murdock]
[92036651] Failed to execute fetch phase
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException:
rejected execution (queue capacity 300) on
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2@23f1bc3
at
org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:61)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)

But also we see the following error which we've been unable to find the
diagnosis for:
2014-01-07 15:58:55,530 DEBUG index.shard.service [Matt Murdock]
[index-name][4] Can not build 'doc stats' from engine shard state
[RECOVERING]
org.elasticsearch.index.shard.IllegalIndexShardStateException:
[index-name][4] CurrentState[RECOVERING] operations only allowed when
started/relocated
at
org.elasticsearch.index.shard.service.InternalIndexShard.readAllowed(InternalIndexShard.java:765)

Are we doing anything wrong or has anyone experienced this?

Thanks,
Jenny

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b2328296-e9c9-4763-b61b-6ad2e145e59b%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCSPct9-Useg_cbvVZkwx_OoGVa1J%2B7tJXimpHx00rb8A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

You can also use cluster.routing.allocation.disable_allocation to reduce
the need of waiting for things to rebalance.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 8 January 2014 04:41, Ivan Brusic ivan@brusic.com wrote:

Almost elasticsearch should support clusters of nodes with different minor
versions, I have seen issues between minor versions. Version 0.90.8 did
contain an upgrade of Lucene (4.6), but that does not look like it would
cause your issue. You could look at the github issues tagged 0.90.[8-9] and
see if something applies in your case.

A couple of points about upgrading:

If you want to use the double-the-nodes techniques (which should not be
necessary for minor version upgrades), you could "decommission" a node
using the Shard API. Here is a good writeup:
ElasticSearch Shard Placement Control - Sematext

Since you doubled the amount of nodes in the cluster,
the minimum_master_nodes setting would be temporarily incorrect and
potential split-brain clusters might occur. In fact, it might have occurred
in your case since the cluster state seems incorrect. Merely hypothesizing.

Cheers,

Ivan

On Tue, Jan 7, 2014 at 9:26 AM, Jenny Sivapalan <
jennifer.sivapalan@gmail.com> wrote:

Hello,

We've upgraded Elastic Search twice over the last month and have
experienced downtime (roughly 8 minutes) during the roll out. I'm not sure
if it something we are doing wrong or not.

We use EC2 instances for our Elastic Search cluster and cloud formation
to manage our stack. When we deploy a new version or change to Elastic
Search we upload the new artefact, double the number of EC2 instances and
wait for the new instances to join the cluster.

For example 6 nodes form a cluster on v 0.90.7. We upload the 0.90.9
version via our deployment process and double the number nodes for the
cluster (12). The 6 new nodes will join the cluster with the 0.90.9
version.

We then want to remove each of the 0.90.7 nodes. We do this by shutting
down the node (using the plugin head), wait for the cluster to rebalance
the shards and then terminate the EC2 instances. Then repeat with the next
node. We leave the master node until last so that it does the re-election
just once.

The issue we have found in the last two upgrades is that while the
penultimate node is shutting down the master starts throwing errors and the
cluster goes red. To fix this we've stopped the Elastic Search process on
master and have had to restart each of the other nodes (though perhaps they
would have rebalanced themselves in a longer time period?). We find that
we send an increase error response to our clients during this time.

We've set out queue size for search to 300 and we start to see the queue
gets full:
at java.lang.Thread.run(Thread.java:724)
2014-01-07 15:58:55,508 DEBUG action.search.type [Matt Murdock]
[92036651] Failed to execute fetch phase
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException:
rejected execution (queue capacity 300) on
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2@23f1bc3
at
org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:61)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)

But also we see the following error which we've been unable to find the
diagnosis for:
2014-01-07 15:58:55,530 DEBUG index.shard.service [Matt Murdock]
[index-name][4] Can not build 'doc stats' from engine shard state
[RECOVERING]
org.elasticsearch.index.shard.IllegalIndexShardStateException:
[index-name][4] CurrentState[RECOVERING] operations only allowed when
started/relocated
at
org.elasticsearch.index.shard.service.InternalIndexShard.readAllowed(InternalIndexShard.java:765)

Are we doing anything wrong or has anyone experienced this?

Thanks,
Jenny

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b2328296-e9c9-4763-b61b-6ad2e145e59b%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCSPct9-Useg_cbvVZkwx_OoGVa1J%2B7tJXimpHx00rb8A%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624ZhP-YE0v4ULJChqNj81aAWcYgyHN%3DGpWrTWKY5mroNEg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks both for the replies. Our rebalance process doesn't take too long
(~5 mins per node). I had some of the plugins (head, paramedic, bigdesk)
open as I was closing down the old nodes and didn't see any split brain
issue although I agree we can lead ourselves down this route by doubling
the instances. We want our cluster to rebalance as we bring nodes in and
out so disabling is not going to work for us unless I'm misunderstanding?

On Tuesday, 7 January 2014 22:16:46 UTC, Mark Walkom wrote:

You can also use cluster.routing.allocation.disable_allocation to reduce
the need of waiting for things to rebalance.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com <javascript:>
web: www.campaignmonitor.com

On 8 January 2014 04:41, Ivan Brusic <iv...@brusic.com <javascript:>>wrote:

Almost elasticsearch should support clusters of nodes with different
minor versions, I have seen issues between minor versions. Version 0.90.8
did contain an upgrade of Lucene (4.6), but that does not look like it
would cause your issue. You could look at the github issues tagged
0.90.[8-9] and see if something applies in your case.

A couple of points about upgrading:

If you want to use the double-the-nodes techniques (which should not be
necessary for minor version upgrades), you could "decommission" a node
using the Shard API. Here is a good writeup:
ElasticSearch Shard Placement Control - Sematext

Since you doubled the amount of nodes in the cluster,
the minimum_master_nodes setting would be temporarily incorrect and
potential split-brain clusters might occur. In fact, it might have occurred
in your case since the cluster state seems incorrect. Merely hypothesizing.

Cheers,

Ivan

On Tue, Jan 7, 2014 at 9:26 AM, Jenny Sivapalan <jennifer....@gmail.com<javascript:>

wrote:

Hello,

We've upgraded Elastic Search twice over the last month and have
experienced downtime (roughly 8 minutes) during the roll out. I'm not sure
if it something we are doing wrong or not.

We use EC2 instances for our Elastic Search cluster and cloud formation
to manage our stack. When we deploy a new version or change to Elastic
Search we upload the new artefact, double the number of EC2 instances and
wait for the new instances to join the cluster.

For example 6 nodes form a cluster on v 0.90.7. We upload the 0.90.9
version via our deployment process and double the number nodes for the
cluster (12). The 6 new nodes will join the cluster with the 0.90.9
version.

We then want to remove each of the 0.90.7 nodes. We do this by shutting
down the node (using the plugin head), wait for the cluster to rebalance
the shards and then terminate the EC2 instances. Then repeat with the next
node. We leave the master node until last so that it does the re-election
just once.

The issue we have found in the last two upgrades is that while the
penultimate node is shutting down the master starts throwing errors and the
cluster goes red. To fix this we've stopped the Elastic Search process on
master and have had to restart each of the other nodes (though perhaps they
would have rebalanced themselves in a longer time period?). We find
that we send an increase error response to our clients during this time.

We've set out queue size for search to 300 and we start to see the queue
gets full:
at java.lang.Thread.run(Thread.java:724)
2014-01-07 15:58:55,508 DEBUG action.search.type [Matt Murdock]
[92036651] Failed to execute fetch phase
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException:
rejected execution (queue capacity 300) on
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2@23f1bc3
at
org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:61)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)

But also we see the following error which we've been unable to find the
diagnosis for:
2014-01-07 15:58:55,530 DEBUG index.shard.service [Matt Murdock]
[index-name][4] Can not build 'doc stats' from engine shard state
[RECOVERING]
org.elasticsearch.index.shard.IllegalIndexShardStateException:
[index-name][4] CurrentState[RECOVERING] operations only allowed when
started/relocated
at
org.elasticsearch.index.shard.service.InternalIndexShard.readAllowed(InternalIndexShard.java:765)

Are we doing anything wrong or has anyone experienced this?

Thanks,
Jenny

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b2328296-e9c9-4763-b61b-6ad2e145e59b%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCSPct9-Useg_cbvVZkwx_OoGVa1J%2B7tJXimpHx00rb8A%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5720cb88-f4a9-414d-8299-e6640bf6d7e7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

For myself I decided to classify the update from 0.90.7 to 0.90.9 a major
upgrade. Lucene changed, Java method signatures changed, also new features
arrived. This may lead to trouble when mixing 0.90.7 and 0.90.9 nodes in a
busy cluster.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFS-p7oNagukhrvzom9Fw9xuACV%3D6j%2BgH7%3DTKQwdRVRDw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Disabling allocation is definitely a temporary only change, you can set it
back once you're upgrades are done.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 9 January 2014 02:47, Jenny Sivapalan jennifer.sivapalan@gmail.comwrote:

Thanks both for the replies. Our rebalance process doesn't take too long
(~5 mins per node). I had some of the plugins (head, paramedic, bigdesk)
open as I was closing down the old nodes and didn't see any split brain
issue although I agree we can lead ourselves down this route by doubling
the instances. We want our cluster to rebalance as we bring nodes in and
out so disabling is not going to work for us unless I'm misunderstanding?

On Tuesday, 7 January 2014 22:16:46 UTC, Mark Walkom wrote:

You can also use cluster.routing.allocation.disable_allocation to reduce
the need of waiting for things to rebalance.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com

On 8 January 2014 04:41, Ivan Brusic iv...@brusic.com wrote:

Almost elasticsearch should support clusters of nodes with different
minor versions, I have seen issues between minor versions. Version 0.90.8
did contain an upgrade of Lucene (4.6), but that does not look like it
would cause your issue. You could look at the github issues tagged
0.90.[8-9] and see if something applies in your case.

A couple of points about upgrading:

If you want to use the double-the-nodes techniques (which should not be
necessary for minor version upgrades), you could "decommission" a node
using the Shard API. Here is a good writeup: http://blog.sematext.
com/2012/05/29/elasticsearch-shard-placement-control/

Since you doubled the amount of nodes in the cluster,
the minimum_master_nodes setting would be temporarily incorrect and
potential split-brain clusters might occur. In fact, it might have occurred
in your case since the cluster state seems incorrect. Merely hypothesizing.

Cheers,

Ivan

On Tue, Jan 7, 2014 at 9:26 AM, Jenny Sivapalan jennifer....@gmail.comwrote:

Hello,

We've upgraded Elastic Search twice over the last month and have
experienced downtime (roughly 8 minutes) during the roll out. I'm not sure
if it something we are doing wrong or not.

We use EC2 instances for our Elastic Search cluster and cloud formation
to manage our stack. When we deploy a new version or change to Elastic
Search we upload the new artefact, double the number of EC2 instances and
wait for the new instances to join the cluster.

For example 6 nodes form a cluster on v 0.90.7. We upload the 0.90.9
version via our deployment process and double the number nodes for the
cluster (12). The 6 new nodes will join the cluster with the 0.90.9
version.

We then want to remove each of the 0.90.7 nodes. We do this by shutting
down the node (using the plugin head), wait for the cluster to rebalance
the shards and then terminate the EC2 instances. Then repeat with the next
node. We leave the master node until last so that it does the re-election
just once.

The issue we have found in the last two upgrades is that while the
penultimate node is shutting down the master starts throwing errors and the
cluster goes red. To fix this we've stopped the Elastic Search process on
master and have had to restart each of the other nodes (though perhaps they
would have rebalanced themselves in a longer time period?). We find
that we send an increase error response to our clients during this time.

We've set out queue size for search to 300 and we start to see the
queue gets full:
at java.lang.Thread.run(Thread.java:724)
2014-01-07 15:58:55,508 DEBUG action.search.type [Matt Murdock]
[92036651] Failed to execute fetch phase
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException:
rejected execution (queue capacity 300) on org.elasticsearch.action.
search.type.TransportSearchQueryThenFetchAction$AsyncAction$2@23f1bc3
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.
rejectedExecution(EsAbortPolicy.java:61)
at java.util.concurrent.ThreadPoolExecutor.reject(
ThreadPoolExecutor.java:821)

But also we see the following error which we've been unable to find the
diagnosis for:
2014-01-07 15:58:55,530 DEBUG index.shard.service [Matt
Murdock] [index-name][4] Can not build 'doc stats' from engine shard state
[RECOVERING]
org.elasticsearch.index.shard.IllegalIndexShardStateException:
[index-name][4] CurrentState[RECOVERING] operations only allowed when
started/relocated
at org.elasticsearch.index.shard.service.InternalIndexShard.
readAllowed(InternalIndexShard.java:765)

Are we doing anything wrong or has anyone experienced this?

Thanks,
Jenny

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/b2328296-e9c9-4763-b61b-6ad2e145e59b%
40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CALY%3DcQCSPct9-Useg_cbvVZkwx_
OoGVa1J%2B7tJXimpHx00rb8A%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5720cb88-f4a9-414d-8299-e6640bf6d7e7%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624b5TNq93zxSEvBGAKtPt%3DoRUEmzJE9mjHyp6a0wU5SPDQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Perhaps I am missing some functionality since I am still on version 0.90.2,
but wouldn't you have to disable/enable allocation after each server
restart during a rolling upgrade? A restarted node will not host any shards
with allocation disabled.

Cheers,

Ivan

On Wed, Jan 8, 2014 at 5:48 PM, Mark Walkom markw@campaignmonitor.comwrote:

Disabling allocation is definitely a temporary only change, you can set it
back once you're upgrades are done.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 9 January 2014 02:47, Jenny Sivapalan jennifer.sivapalan@gmail.comwrote:

Thanks both for the replies. Our rebalance process doesn't take too long
(~5 mins per node). I had some of the plugins (head, paramedic, bigdesk)
open as I was closing down the old nodes and didn't see any split brain
issue although I agree we can lead ourselves down this route by doubling
the instances. We want our cluster to rebalance as we bring nodes in and
out so disabling is not going to work for us unless I'm misunderstanding?

On Tuesday, 7 January 2014 22:16:46 UTC, Mark Walkom wrote:

You can also use cluster.routing.allocation.disable_allocation to
reduce the need of waiting for things to rebalance.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com

On 8 January 2014 04:41, Ivan Brusic iv...@brusic.com wrote:

Almost elasticsearch should support clusters of nodes with different
minor versions, I have seen issues between minor versions. Version 0.90.8
did contain an upgrade of Lucene (4.6), but that does not look like it
would cause your issue. You could look at the github issues tagged
0.90.[8-9] and see if something applies in your case.

A couple of points about upgrading:

If you want to use the double-the-nodes techniques (which should not be
necessary for minor version upgrades), you could "decommission" a node
using the Shard API. Here is a good writeup: http://blog.sematext.
com/2012/05/29/elasticsearch-shard-placement-control/

Since you doubled the amount of nodes in the cluster,
the minimum_master_nodes setting would be temporarily incorrect and
potential split-brain clusters might occur. In fact, it might have occurred
in your case since the cluster state seems incorrect. Merely hypothesizing.

Cheers,

Ivan

On Tue, Jan 7, 2014 at 9:26 AM, Jenny Sivapalan <jennifer....@gmail.com

wrote:

Hello,

We've upgraded Elastic Search twice over the last month and have
experienced downtime (roughly 8 minutes) during the roll out. I'm not sure
if it something we are doing wrong or not.

We use EC2 instances for our Elastic Search cluster and cloud
formation to manage our stack. When we deploy a new version or change to
Elastic Search we upload the new artefact, double the number of EC2
instances and wait for the new instances to join the cluster.

For example 6 nodes form a cluster on v 0.90.7. We upload the 0.90.9
version via our deployment process and double the number nodes for the
cluster (12). The 6 new nodes will join the cluster with the 0.90.9
version.

We then want to remove each of the 0.90.7 nodes. We do this by
shutting down the node (using the plugin head), wait for the cluster to
rebalance the shards and then terminate the EC2 instances. Then repeat with
the next node. We leave the master node until last so that it does the
re-election just once.

The issue we have found in the last two upgrades is that while the
penultimate node is shutting down the master starts throwing errors and the
cluster goes red. To fix this we've stopped the Elastic Search process on
master and have had to restart each of the other nodes (though perhaps they
would have rebalanced themselves in a longer time period?). We find
that we send an increase error response to our clients during this time.

We've set out queue size for search to 300 and we start to see the
queue gets full:
at java.lang.Thread.run(Thread.java:724)
2014-01-07 15:58:55,508 DEBUG action.search.type [Matt Murdock]
[92036651] Failed to execute fetch phase
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException:
rejected execution (queue capacity 300) on org.elasticsearch.action.
search.type.TransportSearchQueryThenFetchAction$AsyncAction$2@23f1bc3
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.
rejectedExecution(EsAbortPolicy.java:61)
at java.util.concurrent.ThreadPoolExecutor.reject(
ThreadPoolExecutor.java:821)

But also we see the following error which we've been unable to find
the diagnosis for:
2014-01-07 15:58:55,530 DEBUG index.shard.service [Matt
Murdock] [index-name][4] Can not build 'doc stats' from engine shard state
[RECOVERING]
org.elasticsearch.index.shard.IllegalIndexShardStateException:
[index-name][4] CurrentState[RECOVERING] operations only allowed when
started/relocated
at org.elasticsearch.index.shard.service.InternalIndexShard.
readAllowed(InternalIndexShard.java:765)

Are we doing anything wrong or has anyone experienced this?

Thanks,
Jenny

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/b2328296-e9c9-4763-b61b-6ad2e145e59b%
40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CALY%3DcQCSPct9-Useg_cbvVZkwx_
OoGVa1J%2B7tJXimpHx00rb8A%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5720cb88-f4a9-414d-8299-e6640bf6d7e7%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEM624b5TNq93zxSEvBGAKtPt%3DoRUEmzJE9mjHyp6a0wU5SPDQ%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDrN1gbjp%2BbRmpwMOHCJU4TMvOfaYqjD4nOmJ7Ro9REYA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

That setting tells the nodes to hold the shards they currently have, and in
the event of a node going down for a restart/upgrade, don't redistribute
across the cluster.
When you bring the rebooted/upgraded node back it'll locally reinitialise
the shards it still has.

You can set that setting back to false when you have completed the
upgrades/restarts and the cluster can rebalance if it feels the need to.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 10 January 2014 04:07, Ivan Brusic ivan@brusic.com wrote:

Perhaps I am missing some functionality since I am still on version
0.90.2, but wouldn't you have to disable/enable allocation after each
server restart during a rolling upgrade? A restarted node will not host any
shards with allocation disabled.

Cheers,

Ivan

On Wed, Jan 8, 2014 at 5:48 PM, Mark Walkom markw@campaignmonitor.comwrote:

Disabling allocation is definitely a temporary only change, you can set
it back once you're upgrades are done.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 9 January 2014 02:47, Jenny Sivapalan jennifer.sivapalan@gmail.comwrote:

Thanks both for the replies. Our rebalance process doesn't take too long
(~5 mins per node). I had some of the plugins (head, paramedic, bigdesk)
open as I was closing down the old nodes and didn't see any split brain
issue although I agree we can lead ourselves down this route by doubling
the instances. We want our cluster to rebalance as we bring nodes in and
out so disabling is not going to work for us unless I'm misunderstanding?

On Tuesday, 7 January 2014 22:16:46 UTC, Mark Walkom wrote:

You can also use cluster.routing.allocation.disable_allocation to
reduce the need of waiting for things to rebalance.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com

On 8 January 2014 04:41, Ivan Brusic iv...@brusic.com wrote:

Almost elasticsearch should support clusters of nodes with different
minor versions, I have seen issues between minor versions. Version 0.90.8
did contain an upgrade of Lucene (4.6), but that does not look like it
would cause your issue. You could look at the github issues tagged
0.90.[8-9] and see if something applies in your case.

A couple of points about upgrading:

If you want to use the double-the-nodes techniques (which should not
be necessary for minor version upgrades), you could "decommission" a node
using the Shard API. Here is a good writeup: http://blog.sematext.
com/2012/05/29/elasticsearch-shard-placement-control/

Since you doubled the amount of nodes in the cluster,
the minimum_master_nodes setting would be temporarily incorrect and
potential split-brain clusters might occur. In fact, it might have occurred
in your case since the cluster state seems incorrect. Merely hypothesizing.

Cheers,

Ivan

On Tue, Jan 7, 2014 at 9:26 AM, Jenny Sivapalan <
jennifer....@gmail.com> wrote:

Hello,

We've upgraded Elastic Search twice over the last month and have
experienced downtime (roughly 8 minutes) during the roll out. I'm not sure
if it something we are doing wrong or not.

We use EC2 instances for our Elastic Search cluster and cloud
formation to manage our stack. When we deploy a new version or change to
Elastic Search we upload the new artefact, double the number of EC2
instances and wait for the new instances to join the cluster.

For example 6 nodes form a cluster on v 0.90.7. We upload the 0.90.9
version via our deployment process and double the number nodes for the
cluster (12). The 6 new nodes will join the cluster with the 0.90.9
version.

We then want to remove each of the 0.90.7 nodes. We do this by
shutting down the node (using the plugin head), wait for the cluster to
rebalance the shards and then terminate the EC2 instances. Then repeat with
the next node. We leave the master node until last so that it does the
re-election just once.

The issue we have found in the last two upgrades is that while the
penultimate node is shutting down the master starts throwing errors and the
cluster goes red. To fix this we've stopped the Elastic Search process on
master and have had to restart each of the other nodes (though perhaps they
would have rebalanced themselves in a longer time period?). We find
that we send an increase error response to our clients during this time.

We've set out queue size for search to 300 and we start to see the
queue gets full:
at java.lang.Thread.run(Thread.java:724)
2014-01-07 15:58:55,508 DEBUG action.search.type [Matt
Murdock] [92036651] Failed to execute fetch phase
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException:
rejected execution (queue capacity 300) on org.elasticsearch.action.
search.type.TransportSearchQueryThenFetchAction$AsyncAction$2@23f1bc3
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.
rejectedExecution(EsAbortPolicy.java:61)
at java.util.concurrent.ThreadPoolExecutor.reject(
ThreadPoolExecutor.java:821)

But also we see the following error which we've been unable to find
the diagnosis for:
2014-01-07 15:58:55,530 DEBUG index.shard.service [Matt
Murdock] [index-name][4] Can not build 'doc stats' from engine shard state
[RECOVERING]
org.elasticsearch.index.shard.IllegalIndexShardStateException:
[index-name][4] CurrentState[RECOVERING] operations only allowed when
started/relocated
at org.elasticsearch.index.shard.service.InternalIndexShard.
readAllowed(InternalIndexShard.java:765)

Are we doing anything wrong or has anyone experienced this?

Thanks,
Jenny

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/b2328296-e9c9-4763-b61b-6ad2e145e59b%
40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CALY%3DcQCSPct9-Useg_cbvVZkwx_
OoGVa1J%2B7tJXimpHx00rb8A%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5720cb88-f4a9-414d-8299-e6640bf6d7e7%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEM624b5TNq93zxSEvBGAKtPt%3DoRUEmzJE9mjHyp6a0wU5SPDQ%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDrN1gbjp%2BbRmpwMOHCJU4TMvOfaYqjD4nOmJ7Ro9REYA%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624arWgp44X8FCezxOZefRqZr3TkrPDaTfB59VQ5unuhECA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

That is definitely not the behavior I have ever seen with elasticsearch. If
you restart a node with allocation disabled, the restarted node will have
no shards and the shards that it should contain are marked as unassigned. I
have never seen a node reinitialize the shards it has.

Cheers,

Ivan

On Thu, Jan 9, 2014 at 3:58 PM, Mark Walkom markw@campaignmonitor.comwrote:

That setting tells the nodes to hold the shards they currently have, and
in the event of a node going down for a restart/upgrade, don't redistribute
across the cluster.
When you bring the rebooted/upgraded node back it'll locally reinitialise
the shards it still has.

You can set that setting back to false when you have completed the
upgrades/restarts and the cluster can rebalance if it feels the need to.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDJmdA8q_HC-Vsv4ucsj-p4AicAdBpz%3DZju6dohQuXhbw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.