Just pushed: support for reusing work dir

(Shay Banon) #1


I just pushed initial support for reusing work directory when recovering
from gateway. This might sound cryptic, so I will try and explain it in this
email a bit more, but first, an important notice.

The code pushed is not a full implementation of the feature, there are
still some minor features left to be implemented which I will explain later.
Also note, that sadly, a full reindex is required in order to work with
master, and there is a slight chance that a full reindex might be required
due to the mentioned minor features yet to be implemented. I expect the
feature to be code complete in the next few days.

Last, the cloud plugin is not usable at the moment (it wasn't very much
before due to some bugs in jclouds...). This will take a week or so to
implement properly.

So, why do we need this feature? The way elasticsearch works now is that
the index is stored locally in a node (in an index store module, which can
be either a memory or filesystem). The content of the local node index (for
each primary shard) is snapshotted to a gateway (if one is configured) for
long term persistency. When a full cluster shutdown takes place, and the
cluster is brought back up, the index files stored in the gateway are
fetched and used as a starting point. This takes time, especially for large
indices or slow gateway modules. But, If the local node index storage is
file system based (or partially file system based), one can optionally reuse
the local files on each node and not fetch them from the gateway.

This sounds pretty simple, but it requires several changes to how
elasticsearch works. First, each node started used a local file system
location which included its unique nodeId. Since that is a random UID, it
doesn't cut it if local files are to be reused when recovering from the
gateway. For this reason, each node started tries to acquire
[work]/nodes/[running_number]. The running_number starts from 0 and
increments if there is already a running node on the same machine using it.
This means that after a full cluster shutdown, nodes will reuse the same
location they used to work with.

Another requirement is to smartly allocate shards to where they will

reuse as many index files as possible from the respective node they will be
assigned to. This required enhancements to the shard allocation algorithm.
But it does not stop there... . In elasticsearch, once a single node is
started, the cluster metadata will be recovered and shard allocation will
commence. This is not what we want, especially with large enough cluster,
since reusable data might exists in other nodes that have not started yet.

For this reason, a new setting, `gateway.recover_after_nodes`, can be

set. It can control when the cluster metadata will be read and shard
allocation will begin is respect to the number of currently discovered
nodes. Of course, while not all nodes have started, operations done against
the cluster should be blocked. In order to prevent it, a new feature, the
ability to create "cluster blocks" has been added. Currently, it is used
internally, but in the near future, it will be exposed as an api (the
ability to block "write" operations against the cluster, or against specific
indices, for example).

One of the reasons why a full reindex is required is the fact that for

each file stored in the gateway, another md5 file is stored. This helps in
identifying if a certain index file can be reused from the gateway is
possible. Another reason is that I have taken the opportunity to build a new
mechanism for the translog handling, allowing it to work in a more "stream"
version being able to better handle large transaction logs.

Last, a big refactoring was done in the different gateway

implementations. Now, there is a nicer abstraction for a gateway
implementation, and different gateways share much more code than before. The
cloud plugin has not been refactored to make use of this yet, and a new
implementation of it is planned for 0.9.

What do we have left? A bit more testing on my end, and a cloud plugin

reimplementation. Once this is done, and maybe some additional minor
features, 0.9 will be released.

Hope you are all with me this far, and that I made some sense (its
later...). Any suggestions, comments, questions are more than welcomed.


(system) #2