Hello,
We've upgraded Elastic Search twice over the last month and have
experienced downtime (roughly 8 minutes) during the roll out. I'm not sure
if it something we are doing wrong or not.
We use EC2 instances for our Elastic Search cluster and cloud formation to
manage our stack. When we deploy a new version or change to Elastic Search
we upload the new artefact, double the number of EC2 instances and wait for
the new instances to join the cluster.
For example 6 nodes form a cluster on v 0.90.7. We upload the 0.90.9
version via our deployment process and double the number nodes for the
cluster (12). The 6 new nodes will join the cluster with the 0.90.9
version.
We then want to remove each of the 0.90.7 nodes. We do this by shutting
down the node (using the plugin head), wait for the cluster to rebalance
the shards and then terminate the EC2 instances. Then repeat with the next
node. We leave the master node until last so that it does the re-election
just once.
The issue we have found in the last two upgrades is that while the
penultimate node is shutting down the master starts throwing errors and the
cluster goes red. To fix this we've stopped the Elastic Search process on
master and have had to restart each of the other nodes (though perhaps they
would have rebalanced themselves in a longer time period?). We find that we
send an increase error response to our clients during this time.
We've set out queue size for search to 300 and we start to see the queue
gets full:
at java.lang.Thread.run(Thread.java:724)
2014-01-07 15:58:55,508 DEBUG action.search.type [Matt Murdock]
[92036651] Failed to execute fetch phase
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException:
rejected execution (queue capacity 300) on
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2@23f1bc3
at
org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:61)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
But also we see the following error which we've been unable to find the
diagnosis for:
2014-01-07 15:58:55,530 DEBUG index.shard.service [Matt Murdock]
[index-name][4] Can not build 'doc stats' from engine shard state
[RECOVERING]
org.elasticsearch.index.shard.IllegalIndexShardStateException:
[index-name][4] CurrentState[RECOVERING] operations only allowed when
started/relocated
at
org.elasticsearch.index.shard.service.InternalIndexShard.readAllowed(InternalIndexShard.java:765)
Are we doing anything wrong or has anyone experienced this?
Thanks,
Jenny
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b2328296-e9c9-4763-b61b-6ad2e145e59b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.