Reindexing and error recovery


(Ivan Brusic) #1

Excuse me in advance for the long post. Up until now, I have avoided
rebuilding from source but now is the time to finally dive in.

Trying to determine reindexing best practices. Using aliases appears to be
the suggested solution to creating updated indices in parallel to searching
an existing index. Searchers always use the same index, while the indexers
create new indexes whenever a reindex is needed, which is then aliased to.
While the new index is being created, do you stop indexing on the existing
index? Scan acts like a cursor, so new inserts after the scan was created
do not get picked up.

Does anybody use a river and reindex data? A river is associated with an
index at creation. If the above recipe holds true, I need to delete the
river and then create a new one with the updated index name. Not the best
scenario since I will lose updates, but the data is not mission critical.

On the topic of reindexing, I also need to recover data from an incorrect
node. I switched a single-node elasticsearch server from using monit to
using the service wrapper. On the same day that I made the migration,
something happened on the box (low memory?) that caused the wrapper to think
that ES was not running and started another instance. The default settings
were in place (1 replica, 5 shards), so the new instance took over some of
the shards. Upon noticing the immense slowdown on the machine due to two
instances running, I killed both process and restarted ES. Of course, the
data from the shard moved to the new incorrect instance is missing. I
assume that the shards on the original instance were rebalanced, so I cannot
simply do a file copy.

My plan for data recover is to move the second node data dir to a new
location, start a new non-clustered ES server and query from it. Two node
clients: one for each server (node). Is it possible to restart two nodes and
gracefully tell one of them to accept all merge data? Probably not.

--
Ivan


(Shay Banon) #2

On Tue, Jul 26, 2011 at 3:34 PM, Ivan Brusic ivan@brusic.com wrote:

Excuse me in advance for the long post. Up until now, I have avoided
rebuilding from source but now is the time to finally dive in.

Trying to determine reindexing best practices. Using aliases appears to be
the suggested solution to creating updated indices in parallel to searching
an existing index. Searchers always use the same index, while the indexers
create new indexes whenever a reindex is needed, which is then aliased to.
While the new index is being created, do you stop indexing on the existing
index? Scan acts like a cursor, so new inserts after the scan was created
do not get picked up.

I think you mean search working against one alias, and then switch the alias
from the old index to the new index, right?

Does anybody use a river and reindex data? A river is associated with an
index at creation. If the above recipe holds true, I need to delete the
river and then create a new one with the updated index name. Not the best
scenario since I will lose updates, but the data is not mission critical.

River is not really meant for this. You can write simple code that does the
reindexing using scan search. In the future, we might have a reindex API,
not a river.

On the topic of reindexing, I also need to recover data from an incorrect
node. I switched a single-node elasticsearch server from using monit to
using the service wrapper. On the same day that I made the migration,
something happened on the box (low memory?) that caused the wrapper to think
that ES was not running and started another instance. The default settings
were in place (1 replica, 5 shards), so the new instance took over some of
the shards. Upon noticing the immense slowdown on the machine due to two
instances running, I killed both process and restarted ES. Of course, the
data from the shard moved to the new incorrect instance is missing. I
assume that the shards on the original instance were rebalanced, so I cannot
simply do a file copy.

My plan for data recover is to move the second node data dir to a new
location, start a new non-clustered ES server and query from it. Two node
clients: one for each server (node). Is it possible to restart two nodes and
gracefully tell one of them to accept all merge data? Probably not.

This won't work. Just start the two nodes on the same machine, wait for
green health, and then just kill the second one.

--
Ivan


(Ivan Brusic) #3

On Tue, Jul 26, 2011 at 10:11 AM, Shay Banon
shay.banon@elasticsearch.comwrote:

I think you mean search working against one alias, and then switch the
alias from the old index to the new index, right?

Correct. A scan will not include new inserts/updates, correct?

River is not really meant for this. You can write simple code that does the

reindexing using scan search. In the future, we might have a reindex API,
not a river.

Misspoke. My application indexes new content via a river and those indexes
need to be updated. I am not suggesting the use of rivers for reindexing.
Since the name of the index is specified at river creation time, the river
must be deleted and re-added.

This won't work. Just start the two nodes on the same machine, wait for

green health, and then just kill the second one.

Is that really all? Cannot wait to try it out. I will leave the current
directory structure as is and start a new instance via the service wrapper.
At this point I am not sure whether I should kill the second instance with
the service wrapper (will it knock out both?) or just kill the processes
(the second wrapper and instance).

Application is still in development mode (pet project, so it probably will
always be!). Small, underpowered EC2 instance. Will set replicas to 0 to
prevent further issues (after executing the above) while keeping the shards
at 5. Might need to start indexing a lot more content, so better machines
and a cluster might come in the near future. Currently using a local
gateway with ephemeral storage, switching to EBS.

--
Ivan


(system) #4