Shards stuck in recovery for long periods


(robert-3) #1

Hi all,

Seeing some recent problems on master where shortly after a cluster reboot
and full recovery (all shards green, all nodes connected), on occasion a
shard or two will start being marked as recovery (rebalancing is disabled,
so it's not due to it being moved around). After a few minutes they
sometimes say they're initializing (an order of magnitude after the amount
of time it should take to transfer the shard given all the throttling
settings and network), but in the end they'll never come back up and just
remain in that state until I manually reroute the shard or restart the
server.

We were seeing similar problems on a larger scale on 0.90.3 (multiple
shards never recovering, all of them replicas of the same primary). I
thought I saw something about that being fixed on github, but can't recall
the exact issue now. Any chance these are related? Is there something I can
do to debug what is going on while the shards say initializing/recovering?
Nothing is currently coming up in my logs.

Thanks,

Robert Deaton

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #2

Hey,

anything in the logs, cluster state, hot_threads output, which might add
some information to this?

I am worried about the 'on occasion' part. If everything is good and you do
not change the amount of nodes in your cluster, there should be no need to
mark a shard to recover, so something has happened at that stage. Anything
in the logs then?

--Alex

On Wed, Sep 4, 2013 at 1:45 AM, Robert Deaton robert@quizlet.com wrote:

Hi all,

Seeing some recent problems on master where shortly after a cluster reboot
and full recovery (all shards green, all nodes connected), on occasion a
shard or two will start being marked as recovery (rebalancing is disabled,
so it's not due to it being moved around). After a few minutes they
sometimes say they're initializing (an order of magnitude after the amount
of time it should take to transfer the shard given all the throttling
settings and network), but in the end they'll never come back up and just
remain in that state until I manually reroute the shard or restart the
server.

We were seeing similar problems on a larger scale on 0.90.3 (multiple
shards never recovering, all of them replicas of the same primary). I
thought I saw something about that being fixed on github, but can't recall
the exact issue now. Any chance these are related? Is there something I can
do to debug what is going on while the shards say initializing/recovering?
Nothing is currently coming up in my logs.

Thanks,

Robert Deaton

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #3