However, there is a (pretty big) race condition with this approach - while
reindexing, changes may not make it to the new index. I've looked all over
and haven't found a single solution to address this. The best attempt I've
seen is to buffer updates, but this is tedious and still leaves a race
condition (with a smaller window). My initial thoughts were to create a
write alias that points to the old and new indices and use versioning.
However, there is no way to write to multiple indices atomically.
It seems like this issue should affect most Elasticsearch users (whether
they realize it or not). Does anyone have a good solution to this?
However, there is a (pretty big) race condition with this approach - while
reindexing, changes may not make it to the new index. I've looked all over
and haven't found a single solution to address this. The best attempt I've
seen is to buffer updates, but this is tedious and still leaves a race
condition (with a smaller window). My initial thoughts were to create a
write alias that points to the old and new indices and use versioning.
However, there is no way to write to multiple indices atomically.
It seems like this issue should affect most Elasticsearch users (whether
they realize it or not). Does anyone have a good solution to this?
Have index called foo_1392831890 with alias foo pointing to it
Create index called foo_1392841890 with new config
Scan/scroll everything from the foo alias into foo_1392841890.
Swap alias. Time has now warped backwards.
Run script to reindex everything that happened since I created
foo_1392841890 from the system of record.
If you happen to use jobs to update your index you can pause them during
this process which would prevent things from going back in time. They'd
just stall instead.
Another option is to index into both indexes once they exist. At this
point you'd have to do it manually. I imagine that'd actually be a nice
feature for Elasticsearch to add though.
However, there is a (pretty big) race condition with this approach -
while reindexing, changes may not make it to the new index. I've looked
all over and haven't found a single solution to address this. The best
attempt I've seen is to buffer updates, but this is tedious and still
leaves a race condition (with a smaller window). My initial thoughts were
to create a write alias that points to the old and new indices and use
versioning. However, there is no way to write to multiple indices
atomically.
It seems like this issue should affect most Elasticsearch users (whether
they realize it or not). Does anyone have a good solution to this?
that actually what Clinton Gormley responded to me on twitter when I asked
the same question. I believe it is the way to go, but still we have that
little time frame between your steps 4 and 5. A request asking for a
document on foo may lead to no results when this document has been indexed
while doing step 3. Do you agree?
Matthias
Am Mittwoch, 19. Februar 2014 21:35:56 UTC+1 schrieb Nikolas Everett:
Here is how I do it:
Have index called foo_1392831890 with alias foo pointing to it
Create index called foo_1392841890 with new config
Scan/scroll everything from the foo alias into foo_1392841890.
Swap alias. Time has now warped backwards.
Run script to reindex everything that happened since I created
foo_1392841890 from the system of record.
If you happen to use jobs to update your index you can pause them during
this process which would prevent things from going back in time. They'd
just stall instead.
Another option is to index into both indexes once they exist. At this
point you'd have to do it manually. I imagine that'd actually be a nice
feature for Elasticsearch to add though.
However, there is a (pretty big) race condition with this approach -
while reindexing, changes may not make it to the new index. I've looked
all over and haven't found a single solution to address this. The best
attempt I've seen is to buffer updates, but this is tedious and still
leaves a race condition (with a smaller window). My initial thoughts were
to create a write alias that points to the old and new indices and use
versioning. However, there is no way to write to multiple indices
atomically.
It seems like this issue should affect most Elasticsearch users (whether
they realize it or not). Does anyone have a good solution to this?
I tried to post a reply yesterday but it looks like it never made it.
Thank you all for the quick replies. Here's a slightly better explanation
of where I believe the race condition occurs.
When the scan/scroll starts, the alias is still pointing to the old index,
so updates go to the old index. Let's say you update Document 1. If the
scroll/scan has already passed Document 1, the new index never sees the
update. The three solutions you mentioned Nik are to either:
Keep track of updates manually [tedious]
Pause the jobs that perform the updates [out of sync]
Send updates to both indexes [also tedious]
However, none of these seem ideal.
Andrew
On Tuesday, February 18, 2014 8:41:18 PM UTC-8, Andrew Kane wrote:
However, there is a (pretty big) race condition with this approach - while
reindexing, changes may not make it to the new index. I've looked all over
and haven't found a single solution to address this. The best attempt I've
seen is to buffer updates, but this is tedious and still leaves a race
condition (with a smaller window). My initial thoughts were to create a
write alias that points to the old and new indices and use versioning.
However, there is no way to write to multiple indices atomically.
It seems like this issue should affect most Elasticsearch users (whether
they realize it or not). Does anyone have a good solution to this?
I tried to post a reply yesterday but it looks like it never made it.
Thank you all for the quick replies. Here's a slightly better explanation
of where I believe the race condition occurs.
When the scan/scroll starts, the alias is still pointing to the old index,
so updates go to the old index. Let's say you update Document 1. If the
scroll/scan has already passed Document 1, the new index never sees the
update. The three solutions you mentioned Nik are to either:
Keep track of updates manually [tedious]
Pause the jobs that perform the updates [out of sync]
Send updates to both indexes [also tedious]
However, none of these seem ideal.
Andrew
On Tuesday, February 18, 2014 8:41:18 PM UTC-8, Andrew Kane wrote:
However, there is a (pretty big) race condition with this approach -
while reindexing, changes may not make it to the new index. I've looked
all over and haven't found a single solution to address this. The best
attempt I've seen is to buffer updates, but this is tedious and still
leaves a race condition (with a smaller window). My initial thoughts were
to create a write alias that points to the old and new indices and use
versioning. However, there is no way to write to multiple indices
atomically.
It seems like this issue should affect most Elasticsearch users (whether
they realize it or not). Does anyone have a good solution to this?
How about, while the scan is being done, let updates go to the old index
but with an extra field? Once the alias points to the new index, it's just
a query to fetch the fields with that new field from the old index and then
reindex then into the new one. If the alias changing/new index creation is
unsuccessful , then update old index to remove that new field.
On Friday, February 21, 2014 3:11:52 AM UTC-5, Andrew Kane wrote:
I tried to post a reply yesterday but it looks like it never made it.
Thank you all for the quick replies. Here's a slightly better explanation
of where I believe the race condition occurs.
When the scan/scroll starts, the alias is still pointing to the old index,
so updates go to the old index. Let's say you update Document 1. If the
scroll/scan has already passed Document 1, the new index never sees the
update. The three solutions you mentioned Nik are to either:
Keep track of updates manually [tedious]
Pause the jobs that perform the updates [out of sync]
Send updates to both indexes [also tedious]
However, none of these seem ideal.
Andrew
On Tuesday, February 18, 2014 8:41:18 PM UTC-8, Andrew Kane wrote:
However, there is a (pretty big) race condition with this approach -
while reindexing, changes may not make it to the new index. I've looked
all over and haven't found a single solution to address this. The best
attempt I've seen is to buffer updates, but this is tedious and still
leaves a race condition (with a smaller window). My initial thoughts were
to create a write alias that points to the old and new indices and use
versioning. However, there is no way to write to multiple indices
atomically.
It seems like this issue should affect most Elasticsearch users (whether
they realize it or not). Does anyone have a good solution to this?
Agree with the original poster that none of the existing solutions are
ideal. Making it simpler and safer to roll out revised mappings would be a
huge win if your use case involves incremental revisions/refinements to
your indexing strategies. A lossless solution would especially benefit the
case where ES is being used as the primary data source (an option we have
been considering), since you really don't want to drop a record in that
case.
On Monday, February 24, 2014 9:20:56 AM UTC-5, JoeZ99 wrote:
How about, while the scan is being done, let updates go to the old index
but with an extra field? Once the alias points to the new index, it's just
a query to fetch the fields with that new field from the old index and then
reindex then into the new one. If the alias changing/new index creation is
unsuccessful , then update old index to remove that new field.
On Friday, February 21, 2014 3:11:52 AM UTC-5, Andrew Kane wrote:
I tried to post a reply yesterday but it looks like it never made it.
Thank you all for the quick replies. Here's a slightly better
explanation of where I believe the race condition occurs.
When the scan/scroll starts, the alias is still pointing to the old
index, so updates go to the old index. Let's say you update Document 1. If
the scroll/scan has already passed Document 1, the new index never sees the
update. The three solutions you mentioned Nik are to either:
Keep track of updates manually [tedious]
Pause the jobs that perform the updates [out of sync]
Send updates to both indexes [also tedious]
However, none of these seem ideal.
Andrew
On Tuesday, February 18, 2014 8:41:18 PM UTC-8, Andrew Kane wrote:
However, there is a (pretty big) race condition with this approach -
while reindexing, changes may not make it to the new index. I've looked
all over and haven't found a single solution to address this. The best
attempt I've seen is to buffer updates, but this is tedious and still
leaves a race condition (with a smaller window). My initial thoughts were
to create a write alias that points to the old and new indices and use
versioning. However, there is no way to write to multiple indices
atomically.
It seems like this issue should affect most Elasticsearch users (whether
they realize it or not). Does anyone have a good solution to this?
Short of using a river to feed both indexes the same stream of updates
updates I doubt that you will find any solution. Good news is why it's
tedious as you pointed out, once setup it flows very smoothly. We use
RabbitMQ in our case.
A possible future feature would be an API within index creation that allows
shadowing the indexing of one or more other indexes, without having to go
through duplication client side or using a river. When the shadowed index
goes away so could the shadowing, or api call could delete the shadow.
cheers,
Rob
On Friday, February 21, 2014 12:11:52 AM UTC-8, Andrew Kane wrote:
I tried to post a reply yesterday but it looks like it never made it.
Thank you all for the quick replies. Here's a slightly better explanation
of where I believe the race condition occurs.
When the scan/scroll starts, the alias is still pointing to the old index,
so updates go to the old index. Let's say you update Document 1. If the
scroll/scan has already passed Document 1, the new index never sees the
update. The three solutions you mentioned Nik are to either:
Keep track of updates manually [tedious]
Pause the jobs that perform the updates [out of sync]
Send updates to both indexes [also tedious]
However, none of these seem ideal.
Andrew
On Tuesday, February 18, 2014 8:41:18 PM UTC-8, Andrew Kane wrote:
However, there is a (pretty big) race condition with this approach -
while reindexing, changes may not make it to the new index. I've looked
all over and haven't found a single solution to address this. The best
attempt I've seen is to buffer updates, but this is tedious and still
leaves a race condition (with a smaller window). My initial thoughts were
to create a write alias that points to the old and new indices and use
versioning. However, there is no way to write to multiple indices
atomically.
It seems like this issue should affect most Elasticsearch users (whether
they realize it or not). Does anyone have a good solution to this?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.