On Wed, Feb 5, 2014 at 1:17 PM, Zachary Tong zacharyjtong@gmail.com wrote:
I'm unsure what happened to your cluster, but I don't believe it is
related to the version upgrade. There should be no difference between
0.90.10 and 0.90.11 when it comes to recovery. A few thoughts:
We assumed the same, but the observed behavior suggests otherwise, no?
- You should have three dedicated masters, so that that a quorum is
required for a node to become master. With only two dedicated masters, it
is still possible to get a split-brain where each master thinks they are
the "ruler" of the cluster. It's possible the inconsistency you were
seeing in your indices is due to a split-brain - the master you were
talking to was unaware of the index because you were in a split-brain
arrangement.
I have 'discovery.zen.minimum_master_nodes: 2' set now. I assume, as
well, that this will help prevent such an occurrence. Unfortunately it
wasn't set before.
- Master's don't store data, so you won't see indexed data in the data
path.
I know they don't store the data, but I should still see the index
directories there with their state files (and, post-downgrade, they are,
indeed, there).
- gateway.recover_after_data_nodes is a useful setting to have
configured when doing full-cluster restarts. It prevents allocation while
your cluster is coming back online, which helps speed up the restart. Not
really related to your problem, but I wanted to mention it.
We have had that set to half our cluster node size, rounded up (using
templates in puppet, so it auto-adjusts as we add/remove nodes), since
doing the master/data node split (several months, at this point).
Did you see anything unusual in the logs? Perhaps references to dangling
indices? Normally if you have indices in the data path that aren't
represented by indicies in the cluster state (e.g. the data is there, but
the master does not think the index exists in the cluster state), you'll
see warnings about "dangling indices". Those warnings will quickly go away
as the "dangling" indices are re-added to the cluster state...you should
see notices in the log stating something like "no longer dangling".
Nothing like that, and it wasn't a matter of impatience -- it went to green
and sat there for a half hour without showing even close to the appropriate
number of active_shards in the _cluster/health.
When I tried the "master data purge" to force it to regen, just the normal
"creating index" and "update_mapping" lines showed up in the logs, but
again, stopped far short of the active_shards that were there before.
My gut says you were in a split-brain and your cluster state was weird.
Make sure you have three dedicated masters, and set minimum_master_nodes
to 2.
That's kind of my impression, too, but why the inability to recover even
after full stop/starts (some with master data dir purges, some not) then
immediate recovery after downgrade?
Also...why have 3/use 2? Is the "unused" 3rd still tracking things, or is
it just sitting there dormant? (I rather assumed it was closer to the
latter, and thus planned to just spin up another EC2 node if I had an
unrecoverable death of one of the masters)
-jv
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANHbUjLDURE-Zw-3GMYDNDcMXpUKSFgBWnH-vgANHPPWYLXbug%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.