Probably my explanation wasn't very extensive, sorry, I'll try to
clarify.
What I am trying to simulate is an unrecoverable missing index, (in my
case each index uses only one shard).
The test I did consists in adding for example 8 documents, 4 to
index1, and 4 to index2, with two instances up and running, ES1, and
ES2.
I don't know the internals, but index1 ends up in one ES, lets say
ES1, and index2 in ES2. If replicas are 0, then you have one index (of
one shard) in each ES node. That's case 2) in my other post.
In case 1), as the replicas were set to 1, I had in each of the two ES
nodes a copy of index1 and index2. So when I killed both nodes, then
bring back say ES1, it already contained index1 and index2. That's why
I had to manually delete index1 in ES1, in which case it recreated it
empty, and brought back index2 intact, and the whole (single node)
cluster was up and running for work.
Maybe my test wasn't very happy, but I hope you get the picture of
what I am trying to do, I would like to know if there is a command/
parameter/etc to call as a last resource, like I said in my first
post, "bring back whatever you can find after 5 min", so the cluster
won't be blocked. Sometimes, it's more affordable to lose some data
than to lose the full cluster for preventing missing data.
I was thinking something like: "recover_unlock_cluster: 5m"
gateway:
type: local
recover_after_time: 2m
recover_after_nodes: 2
recover_unlock_cluster: 5m
Do you think it is possible?
Thanks,
Sebastian.
On Sep 30, 3:39 am, Shay Banon shay.ba...@elasticsearch.com wrote:
I don't really understand your tests to be honest. Why do you delete the
actual index from the relevant machine? It does not make sense. What are you
trying to simulate?
There is no specific index allocation to a specific node, indices are
created on all nodes, but shards area allocated between the different
nodes. So something like: ES1-index1 is not really meaningful.
In case of no replicas, a specific index will be blocked until, between all
nodes, at least one copy of each shard can be found. Once it is found, it
will be recovered.
-shay.banon
On Thu, Sep 30, 2010 at 8:13 AM, Sebastian sgavar...@gmail.com wrote:
I did a couple of tests, and I found two different cases.
The common parameter used:
recover_after_nodes: 1
- If I set replicas to 1, added the documents to two different
indices, in two nodes (ES1->index1-2, ES2->index1-2), kill both, pick
ES1 for example and delete it's index1, bring back only ES1,
it automatically created back and empty index1, and index2 is fine
with its data. If I bring back ES2, it copies the empty index1 from
the master ES1, overwriting it's own index1, which is expected.
- If I set replicas to 0, added the documents to two different
indices, in two nodes (ES1->index1, ES2->index2) so each goes to a
different node, kill both, bring back one, for example ES1 which
contains only index1, I get:
"error" : "ClusterBlockException[blocked by: [3/index not recovered
(not enough nodes with shards allocated found)];]"
So it doesn't try to recreate index2 as in the previous case. Is this
the default behaviour? if replicas are 0, then don't recreate it?
Thanks,
Sebastian.
On Sep 30, 2:49 am, Shay Banon shay.ba...@elasticsearch.com wrote:
Whats your configuration for each node?
On Thu, Sep 30, 2010 at 7:34 AM, Sebastian sgavar...@gmail.com wrote:
I am evaluating a couple of bad shutdown and recovery scenarios. I
indexed some documents in two indices in two ES instances, with no
replicas, so each instance holds a unique copy/shard/index.
ES1 -> Index1 -> {doc1,doc2,...,doc10}
ES2 -> Index2 -> {doc20,doc21,...,doc30}
If I killed both instances without a graceful shutdown, and then bring
back only one, when queried through the API it throws a cluster state
exception preventing any operation, so far so good.
I would like to know if there are some configuration options that
allow me to bring up the cluster again, eventually losing some
indices, in a best effort to recover, but as a trade off with
downtime. Maybe a parameter "bring back whatever you can find after 5
min", or a cluster unblock command. Then I would manually schedule a
full index rebuild from DB storage, but with almost no downtime in the
live site.
To further elaborate, I know the idea is to have everything
replicated, so this case shouldn't happen, but if it does, for
whatever unplanned case or error, I'd like to be able to accept some
lost data and not the full cluster down and blocked and have to
manually delete all the indices and do a full index rebuild. Can that
be done?
Thanks,
Sebastian.