A few days ago, our cluster started reporting that the vast majority of our
shards were unassigned. I found this odd, and even over the weekend there
were 0 relocating or initializing shards. This morning, there are a few
initializing shards, but unassigned has been sitting around 260 for the
last 5 days. I tried restarting the cluster (which has usually been the
quickest way to shock it back into initializing shards quickly), but no
matter what I try, I can't get them to load. As of yet, I've found nothing
in the logs.
To add to the strangeness, some nodes seem to disagree about who's in the
cluster. We've got a 12-node cluster running 0.19.3 on m1.xlarge instances.
The master (and most nodes) report that there are 9 machines in the
cluster, though the missing nodes report that all 12 are present. I've
tried stopping and starting elasticsearch on each of the missing machines,
but that hasn't helped the situation. Like before, I haven't found anything
useful in the logs yet.
This cluster has been fairly reliable in general, so I'm hoping that this
looks very symptomatic of a particular issue to someone else.
Here's the gist of what the instances are