I'm running a cluster that's fairly busy, and after doing a full cluster
restart and a bit of tuning to the configs, I'm getting an occasional
unassigned replica shard that never seems to get assigned. We generate a
new index daily, and last night I saw for the first time a new index get
created with a missing replica shard and show up as unassigned. I've since
deleted that index and let it recreate from inbound data, and it created a
second time the same way, with one replica unassigned.
So really, two issues here:
- On a cluster restart, four indices came up with an unassigned replica
shard, and the cluster did not self-correct that condition.
- On a new index being generated, one replica shard is missing,
repeatable 2x, and the cluster did not self-correct that condition.
I found a solution with the help of a user in the #elasticsearch channel
that showed me how to reindex the older indices to get the missing replica
shard back (using the Tire library's index.reindex method in combination
with an alias) and I am in the process of reindexing the four older indices
that had missing replicas. That process will take more than a week (~2.5
days per index) with the size of things and level of IO activity we
I really want to get to the bottom of why this happened, and even more
importantly, why a new index would get created without all of the required
We haven't seen this happen before, so I suspect that it's a product of the
tuning that I recently did. Here's what changed:
- Doubled the size of the cluster from 3 nodes to 6.
- Increased the shard and replica count from 3 shards to 6, and from 1
replica to 2 at the same time.
- "index.routing.allocation.total_shards_per_node" : 3
- discovery.zen.minimum_master_nodes: 4
- gateway.recover_after_nodes: 4
- gateway.recover_after_time: 10m
- gateway.expected_data_nodes: 2
- gateway.expected_master_nodes: 6
Is there anything in those settings that stand out as a misconfiguration or
a potential culprit for the behavior I'm seeing? I haven't seen anything
in logging so far to indicate an issue. Are there other data points that
would be useful in troubleshooting this? I don't know how to reproduce it
so I'm skipping over creating the gist that the website requests for the
moment until I get a bit of feedback on what's actually useful.
Thanks in advance for your help with this.
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to email@example.com.
For more options, visit https://groups.google.com/groups/opt_out.