Getting occasional unassigned replica shards


(Mike Turner) #1

Hi there,

I'm running a cluster that's fairly busy, and after doing a full cluster
restart and a bit of tuning to the configs, I'm getting an occasional
unassigned replica shard that never seems to get assigned. We generate a
new index daily, and last night I saw for the first time a new index get
created with a missing replica shard and show up as unassigned. I've since
deleted that index and let it recreate from inbound data, and it created a
second time the same way, with one replica unassigned.

So really, two issues here:

  1. On a cluster restart, four indices came up with an unassigned replica
    shard, and the cluster did not self-correct that condition.
  2. On a new index being generated, one replica shard is missing,
    repeatable 2x, and the cluster did not self-correct that condition.

I found a solution with the help of a user in the #elasticsearch channel
that showed me how to reindex the older indices to get the missing replica
shard back (using the Tire library's index.reindex method in combination
with an alias) and I am in the process of reindexing the four older indices
that had missing replicas. That process will take more than a week (~2.5
days per index) with the size of things and level of IO activity we
presently have.

I really want to get to the bottom of why this happened, and even more
importantly, why a new index would get created without all of the required
shards.

We haven't seen this happen before, so I suspect that it's a product of the
tuning that I recently did. Here's what changed:

  1. Doubled the size of the cluster from 3 nodes to 6.
  2. Increased the shard and replica count from 3 shards to 6, and from 1
    replica to 2 at the same time.
  3. "index.routing.allocation.total_shards_per_node" : 3
  4. discovery.zen.minimum_master_nodes: 4
  5. gateway.recover_after_nodes: 4
  6. gateway.recover_after_time: 10m
  7. gateway.expected_data_nodes: 2
  8. gateway.expected_master_nodes: 6

Is there anything in those settings that stand out as a misconfiguration or
a potential culprit for the behavior I'm seeing? I haven't seen anything
in logging so far to indicate an issue. Are there other data points that
would be useful in troubleshooting this? I don't know how to reproduce it
so I'm skipping over creating the gist that the website requests for the
moment until I get a bit of feedback on what's actually useful.

Thanks in advance for your help with this.

Michael Turner

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Mike Turner) #2

I should also note -- 0.90.3 running on OEL6U2.

On Wednesday, October 30, 2013 3:14:07 PM UTC-7, Mike Turner wrote:

Hi there,

I'm running a cluster that's fairly busy, and after doing a full cluster
restart and a bit of tuning to the configs, I'm getting an occasional
unassigned replica shard that never seems to get assigned. We generate a
new index daily, and last night I saw for the first time a new index get
created with a missing replica shard and show up as unassigned. I've since
deleted that index and let it recreate from inbound data, and it created a
second time the same way, with one replica unassigned.

So really, two issues here:

  1. On a cluster restart, four indices came up with an unassigned replica
    shard, and the cluster did not self-correct that condition.
  2. On a new index being generated, one replica shard is missing,
    repeatable 2x, and the cluster did not self-correct that condition.

I found a solution with the help of a user in the #elasticsearch channel
that showed me how to reindex the older indices to get the missing replica
shard back (using the Tire library's index.reindex method in combination
with an alias) and I am in the process of reindexing the four older indices
that had missing replicas. That process will take more than a week (~2.5
days per index) with the size of things and level of IO activity we
presently have.

I really want to get to the bottom of why this happened, and even more
importantly, why a new index would get created without all of the required
shards.

We haven't seen this happen before, so I suspect that it's a product of
the tuning that I recently did. Here's what changed:

  1. Doubled the size of the cluster from 3 nodes to 6.
  2. Increased the shard and replica count from 3 shards to 6, and from 1
    replica to 2 at the same time.
  3. "index.routing.allocation.total_shards_per_node" : 3
  4. discovery.zen.minimum_master_nodes: 4
  5. gateway.recover_after_nodes: 4
  6. gateway.recover_after_time: 10m
  7. gateway.expected_data_nodes: 2
  8. gateway.expected_master_nodes: 6

Is there anything in those settings that stand out as a misconfiguration
or a potential culprit for the behavior I'm seeing? I haven't seen
anything in logging so far to indicate an issue. Are there other data
points that would be useful in troubleshooting this? I don't know how to
reproduce it so I'm skipping over creating the gist that the website
requests for the moment until I get a bit of feedback on what's actually
useful.

Thanks in advance for your help with this.

Michael Turner

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #3