Getting occasional unassigned replica shards

Hi there,

I'm running a cluster that's fairly busy, and after doing a full cluster
restart and a bit of tuning to the configs, I'm getting an occasional
unassigned replica shard that never seems to get assigned. We generate a
new index daily, and last night I saw for the first time a new index get
created with a missing replica shard and show up as unassigned. I've since
deleted that index and let it recreate from inbound data, and it created a
second time the same way, with one replica unassigned.

So really, two issues here:

  1. On a cluster restart, four indices came up with an unassigned replica
    shard, and the cluster did not self-correct that condition.
  2. On a new index being generated, one replica shard is missing,
    repeatable 2x, and the cluster did not self-correct that condition.

I found a solution with the help of a user in the #elasticsearch channel
that showed me how to reindex the older indices to get the missing replica
shard back (using the Tire library's index.reindex method in combination
with an alias) and I am in the process of reindexing the four older indices
that had missing replicas. That process will take more than a week (~2.5
days per index) with the size of things and level of IO activity we
presently have.

I really want to get to the bottom of why this happened, and even more
importantly, why a new index would get created without all of the required
shards.

We haven't seen this happen before, so I suspect that it's a product of the
tuning that I recently did. Here's what changed:

  1. Doubled the size of the cluster from 3 nodes to 6.
  2. Increased the shard and replica count from 3 shards to 6, and from 1
    replica to 2 at the same time.
  3. "index.routing.allocation.total_shards_per_node" : 3
  4. discovery.zen.minimum_master_nodes: 4
  5. gateway.recover_after_nodes: 4
  6. gateway.recover_after_time: 10m
  7. gateway.expected_data_nodes: 2
  8. gateway.expected_master_nodes: 6

Is there anything in those settings that stand out as a misconfiguration or
a potential culprit for the behavior I'm seeing? I haven't seen anything
in logging so far to indicate an issue. Are there other data points that
would be useful in troubleshooting this? I don't know how to reproduce it
so I'm skipping over creating the gist that the website requests for the
moment until I get a bit of feedback on what's actually useful.

Thanks in advance for your help with this.

Michael Turner

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I should also note -- 0.90.3 running on OEL6U2.

On Wednesday, October 30, 2013 3:14:07 PM UTC-7, Mike Turner wrote:

Hi there,

I'm running a cluster that's fairly busy, and after doing a full cluster
restart and a bit of tuning to the configs, I'm getting an occasional
unassigned replica shard that never seems to get assigned. We generate a
new index daily, and last night I saw for the first time a new index get
created with a missing replica shard and show up as unassigned. I've since
deleted that index and let it recreate from inbound data, and it created a
second time the same way, with one replica unassigned.

So really, two issues here:

  1. On a cluster restart, four indices came up with an unassigned replica
    shard, and the cluster did not self-correct that condition.
  2. On a new index being generated, one replica shard is missing,
    repeatable 2x, and the cluster did not self-correct that condition.

I found a solution with the help of a user in the elasticsearch channel
that showed me how to reindex the older indices to get the missing replica
shard back (using the Tire library's index.reindex method in combination
with an alias) and I am in the process of reindexing the four older indices
that had missing replicas. That process will take more than a week (~2.5
days per index) with the size of things and level of IO activity we
presently have.

I really want to get to the bottom of why this happened, and even more
importantly, why a new index would get created without all of the required
shards.

We haven't seen this happen before, so I suspect that it's a product of
the tuning that I recently did. Here's what changed:

  1. Doubled the size of the cluster from 3 nodes to 6.
  2. Increased the shard and replica count from 3 shards to 6, and from 1
    replica to 2 at the same time.
  3. "index.routing.allocation.total_shards_per_node" : 3
  4. discovery.zen.minimum_master_nodes: 4
  5. gateway.recover_after_nodes: 4
  6. gateway.recover_after_time: 10m
  7. gateway.expected_data_nodes: 2
  8. gateway.expected_master_nodes: 6

Is there anything in those settings that stand out as a misconfiguration
or a potential culprit for the behavior I'm seeing? I haven't seen
anything in logging so far to indicate an issue. Are there other data
points that would be useful in troubleshooting this? I don't know how to
reproduce it so I'm skipping over creating the gist that the website
requests for the moment until I get a bit of feedback on what's actually
useful.

Thanks in advance for your help with this.

Michael Turner

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.