First, if you upgrade to 0.9, you will need to reindex your data. And
upgrade to 0.9.
Second, lets get all this so called "magic" attitude out of the way. Are you
suggesting that intentionally this information is being hidden? If no, then
please stop. if yes, then it takes about 5 minutes to do a search on the
mailing list / docs / talks and find the answers you are looking for.
In any case, lets write it again... . When you create an index in
elasticsearch, the index is broken down into shards. A shard can have 0 or
more replicas. Shards and their replicas are allocated to different nodes,
while elasticsearch makes sure to not allocate a shard and its replica to
the same node.
Within a single shard and its replicas, a primary is chosen. One of the
primary main purpose is to perform the scheduled snapshot operations from
the shard index to the gateway.
When a primary shard is first allocated to a node, it performs a recovery
from the gateway. When a replica of the same shard is allocated, it recovers
its state from the primary. If another node is started, and the primary
needs to relocate to a different node to keep the number of shards balanced
across nodes, then it will do a hot relocation, not another recovery from
the gateway.
If you want to know where things are allocate, there is a simple API, the
cluster state API, that gives you information about all the different
indices, shards, replicas, where they are allocated, what their state is and
so on.
Back to the problems you described. First, the s3 gateway. In 0.8, jclouds
was used. If you follow the mailing list, you will see that I issued several
times warnings that it seems to be misbehaving, and I am going to replace it
in 0.9. In 0.9, I went with the Amazon formal SDK, so hoping for better
things now... .
Second, there were several bugs in 0.8 that were fixed in 0.9. One of the
more major ones is the fact that when a node where a primary shard was
allocated was shutdown, than a replica will not become primary (as it
should), but a new primary will be allocated, and a full recovery from the
gateway will happen. This is bad for two reasons. First, it means that
allocation is much slower, and the second, if the gateway misbehaves as the
jclouds case, then the recovery might not work.
Last, regarding being tested at scale. What you are testing is not scale (3
nodes). There are bugs, they are being fixed, but for example, the problems
you were having are tested using simple 3 nodes automatic integration tests.
As for scale, I do some testings on ec2, as well as other kind elasticsearch
users who are running the system at scale and provide valuable information
(and not magic) back to me and help fix any problems.
-shay.banon
On Thu, Jul 29, 2010 at 10:52 PM, David Jensen djensen47@gmail.com wrote:
Well, I actually had a complete cluster meltdown. I'm not sure when
the magic happened but all of my documents were restored from the
Gateway and it was fairly snappy too but my index is only 21GB.
The magic restore either happened automatically or it happened when I
invoked http://localhost:9200/indexname/_refresh. My guess would be
the former.
On Jul 29, 12:16 pm, Berkay Mollamustafaoglu mber...@gmail.com
wrote:
You can't restore from the gateway directly. Basically you'll need to
read
all docs from old index and write to new one.
Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype
On Thu, Jul 29, 2010 at 3:15 PM, David Jensen djense...@gmail.com
wrote:
I guess the upside is that now that I've lost all 12M of my documents
I can upgrade to 0.9.0.
How do I restore the index from the gateway?
On Jul 29, 12:01 pm, David Jensen djense...@gmail.com wrote:
I was a little worried about the "magic" that Elasticsearch provides.
Here is an example why ...
After several days of indexing, I ran into an Exception reporting too
many open files. I fixed the issue on the system and restarted ONE of
my THREE nodes. Before the restart, I had 12.7M documents. After the
restart, I have 5.1M documents. I'm also have an S3 Gateway. It also
turns out (likely related to the file issue) that S3 snapshot were
failing, but only this morning and I hadn't indexed anything new for
about 14-20 hours.
Losing indexed documents like this is worrisome and this isn't the
first time this has happened. When I first started playing with
Elasticsearch, I did a test where I loaded documents onto two servers
and dropped one of the servers. I lost half my documents.
I'm curious, how is Elasticsearch being tested at scale?