Full cluster restart consistently fails to assign all shards


(Brent Reed) #1

I am running ES 1.0.1 (and have also verified the same problem with 1.1.0).

I have a cluster of 9 nodes - 8 are http/data nodes and 1 is http/master
(this is a dev/test cluster so running with only one master). I create a
new index with 8 shards no replicas, and populate the index. Everything is
running great. Then I do a full cluster restart. *When everything comes
back up, however, all looks perfect EXCEPT that every time (this is very
consistent) I have a single shard that doesn't get assigned... *Yes,
gateway is set to local - I am using a completely stock config file with
the exception of pathing (data, logs, plugins) and cluster name.

I can't find any information on this to determine if it is expected
behavior (i hope not) or how to resolve it. I have systematically been
changing elasticsearch.yaml configs to see if anything helps fix, but
nothing seems to resolve the issue. I should note that when simulating a
production environment with rolling restarts, there is no issue. Still,
this just feels like incorrect behavior...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1398e083-f978-4c33-9054-fa8ded0d754d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #2

Is it a primary, replica? Is it in an initialising or relocating state?
Do the logs show anything?

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 1 April 2014 06:38, Brent Reed brent.j.reed@gmail.com wrote:

I am running ES 1.0.1 (and have also verified the same problem with 1.1.0).

I have a cluster of 9 nodes - 8 are http/data nodes and 1 is http/master
(this is a dev/test cluster so running with only one master). I create a
new index with 8 shards no replicas, and populate the index. Everything is
running great. Then I do a full cluster restart. *When everything comes
back up, however, all looks perfect EXCEPT that every time (this is very
consistent) I have a single shard that doesn't get assigned... *Yes,
gateway is set to local - I am using a completely stock config file with
the exception of pathing (data, logs, plugins) and cluster name.

I can't find any information on this to determine if it is expected
behavior (i hope not) or how to resolve it. I have systematically been
changing elasticsearch.yaml configs to see if anything helps fix, but
nothing seems to resolve the issue. I should note that when simulating a
production environment with rolling restarts, there is no issue. Still,
this just feels like incorrect behavior...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1398e083-f978-4c33-9054-fa8ded0d754d%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/1398e083-f978-4c33-9054-fa8ded0d754d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624aUWcqSgonGa280QxEqMqhM9MMU7NThHvkNbMPLSkV9eQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Brent Reed) #3

It is a primary shard (I don't have any replicas on this particular test
cluster). I am seeing the following log entry that correlates with the
failure, but it doesn't tell me much...

[2014-03-31 13:47:20,691][DEBUG][gateway.local ] [http1]
[bjr02][1]: not allocating, number_of_allocated_shards_found [0],
required_number [1]

Here is a SS (head plugin) after index creation:

https://lh3.googleusercontent.com/-kVkJAa9rdY8/UznTcd0sF_I/AAAAAAABRxg/qmk6eUE7LGI/s1600/before_cluster_restart.gif

and after restart:

https://lh4.googleusercontent.com/-HnY3qHswsJM/UznVFpmIsAI/AAAAAAABRxs/XJY2WyvFWgs/s1600/after_cluster_restart.gif

A manual curl call to allocate shard 1 will successfully add it back into
the cluster fully intact and working (no data in this particular index but
have verified with actual data so this isn't an index corruption type
scenario).

On Monday, March 31, 2014 2:03:03 PM UTC-6, Mark Walkom wrote:

Is it a primary, replica? Is it in an initialising or relocating state?
Do the logs show anything?

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com <javascript:>
web: www.campaignmonitor.com

On 1 April 2014 06:38, Brent Reed <brent....@gmail.com <javascript:>>wrote:

I am running ES 1.0.1 (and have also verified the same problem with
1.1.0).

I have a cluster of 9 nodes - 8 are http/data nodes and 1 is http/master
(this is a dev/test cluster so running with only one master). I create a
new index with 8 shards no replicas, and populate the index. Everything is
running great. Then I do a full cluster restart. *When everything
comes back up, however, all looks perfect EXCEPT that every time (this is
very consistent) I have a single shard that doesn't get assigned... *Yes,
gateway is set to local - I am using a completely stock config file with
the exception of pathing (data, logs, plugins) and cluster name.

I can't find any information on this to determine if it is expected
behavior (i hope not) or how to resolve it. I have systematically been
changing elasticsearch.yaml configs to see if anything helps fix, but
nothing seems to resolve the issue. I should note that when simulating a
production environment with rolling restarts, there is no issue. Still,
this just feels like incorrect behavior...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1398e083-f978-4c33-9054-fa8ded0d754d%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/1398e083-f978-4c33-9054-fa8ded0d754d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bed70da6-a658-416a-8cd5-d47d2fef24b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Brent Reed) #4

A little more digging into the error, it sure looks like elasticsearch is
getting confused/broken when trying to recover. My 'http' (master only)
nodes appear to be getting included in attempts to recover, resulting in
the error posted above...

Perhaps I am jumping to conclusions here, but if so, sure smells like a bug
to me - a master only node should not even be considered for recovery
efforts.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d60e6fc0-5bff-4ea0-b7d4-f2cea8dcac08%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5