Reindexing with new mapping


(Curtis Caravone) #1

We are in a situation where we need to reindex a few hundred million docs
(add some indexed fields, add some new fields). We hope to do this online
by changing the mapping then performing updates on all the old docs to
reindex them with the new mapping.

Along these lines, I have a couple of questions:

  1. Can a field mapping be changed from "index":"no" to to indexed using the
    put mapping API?
  2. Can the default analyzer be changed with put mapping API?

thanks,

Curtis


(David Pilato) #2

Hi Curtis,

In case if it's not possible to update mapping as you describe (which is my opinon), I asked for a feature request to allow recreating data from an ES instance to another one with a river.

Not sure I have enough ES background to start writing it... :frowning:

David :wink:

Le 12 sept. 2011 à 02:00, Curtis Caravone caravone@gmail.com a écrit :

We are in a situation where we need to reindex a few hundred million docs (add some indexed fields, add some new fields). We hope to do this online by changing the mapping then performing updates on all the old docs to reindex them with the new mapping.

Along these lines, I have a couple of questions:

  1. Can a field mapping be changed from "index":"no" to to indexed using the put mapping API?
  2. Can the default analyzer be changed with put mapping API?

thanks,

Curtis


(Shay Banon) #3

You can't change a field from not being indexed to being indexed. You can
change the default analyzer, but this will only affect future documents
indexed (by closing the index, updating the index settings, and then opening
it).

If you end up reindexing the data, why not just index it into a new index
with the new mappings?

On Mon, Sep 12, 2011 at 3:00 AM, Curtis Caravone caravone@gmail.com wrote:

We are in a situation where we need to reindex a few hundred million docs
(add some indexed fields, add some new fields). We hope to do this online
by changing the mapping then performing updates on all the old docs to
reindex them with the new mapping.

Along these lines, I have a couple of questions:

  1. Can a field mapping be changed from "index":"no" to to indexed using
    the put mapping API?
  2. Can the default analyzer be changed with put mapping API?

thanks,

Curtis


(Curtis Caravone) #4

Ok, thanks. In that case I will go the route of indexing into a new index.

Curtis

On Mon, Sep 12, 2011 at 2:19 AM, Shay Banon kimchy@gmail.com wrote:

You can't change a field from not being indexed to being indexed. You can
change the default analyzer, but this will only affect future documents
indexed (by closing the index, updating the index settings, and then opening
it).

If you end up reindexing the data, why not just index it into a new index
with the new mappings?

On Mon, Sep 12, 2011 at 3:00 AM, Curtis Caravone caravone@gmail.comwrote:

We are in a situation where we need to reindex a few hundred million docs
(add some indexed fields, add some new fields). We hope to do this online
by changing the mapping then performing updates on all the old docs to
reindex them with the new mapping.

Along these lines, I have a couple of questions:

  1. Can a field mapping be changed from "index":"no" to to indexed using
    the put mapping API?
  2. Can the default analyzer be changed with put mapping API?

thanks,

Curtis


(Curtis Caravone) #5

Ok, I'm going with the strategy of creating a new index, but I want to do
the reindexing all online using aliases.

That leads to a couple of alias questions:

  1. How are doc ids treated when you do a get (or search) operation on an
    alias with multiple underlying indices?
  2. What happens if two docs with the same id exist in two of the underlying
    indices? Is there some precedence or order to which doc is returned?
  3. Does the index status API work with aliases? For example, can I wait
    for yellow status on an alias rather than listing all the underlying
    indices?

Thanks again,

Curtis

On Mon, Sep 12, 2011 at 11:18 AM, Curtis Caravone caravone@gmail.comwrote:

Ok, thanks. In that case I will go the route of indexing into a new index.

Curtis

On Mon, Sep 12, 2011 at 2:19 AM, Shay Banon kimchy@gmail.com wrote:

You can't change a field from not being indexed to being indexed. You can
change the default analyzer, but this will only affect future documents
indexed (by closing the index, updating the index settings, and then opening
it).

If you end up reindexing the data, why not just index it into a new index
with the new mappings?

On Mon, Sep 12, 2011 at 3:00 AM, Curtis Caravone caravone@gmail.comwrote:

We are in a situation where we need to reindex a few hundred million docs
(add some indexed fields, add some new fields). We hope to do this online
by changing the mapping then performing updates on all the old docs to
reindex them with the new mapping.

Along these lines, I have a couple of questions:

  1. Can a field mapping be changed from "index":"no" to to indexed using
    the put mapping API?
  2. Can the default analyzer be changed with put mapping API?

thanks,

Curtis


(Shay Banon) #6

I think you are trying to use the aliases wrongly. When you reindex and want
to do "hot" replace of indices, you use a single alias pointing to a single
index as the one the "client" uses. For example, have alias1 point to
index1. Then, you reindex the data into index2, and once its done, you
switch (in the same command) alias1 to point to index2 from index1.

In the above usecase, there is no point where an alias is pointing to more
than one index. Of course, you can have an alias point to more than one
index, but it only really make sense when searching, and in this case, its
the same as searching across several indices, where the index is part of the
"uniqueness" of the document.

On Tue, Sep 13, 2011 at 9:05 PM, Curtis Caravone caravone@gmail.com wrote:

Ok, I'm going with the strategy of creating a new index, but I want to do
the reindexing all online using aliases.

That leads to a couple of alias questions:

  1. How are doc ids treated when you do a get (or search) operation on an
    alias with multiple underlying indices?
  2. What happens if two docs with the same id exist in two of the
    underlying indices? Is there some precedence or order to which doc is
    returned?
  3. Does the index status API work with aliases? For example, can I wait
    for yellow status on an alias rather than listing all the underlying
    indices?

Thanks again,

Curtis

On Mon, Sep 12, 2011 at 11:18 AM, Curtis Caravone caravone@gmail.comwrote:

Ok, thanks. In that case I will go the route of indexing into a new
index.

Curtis

On Mon, Sep 12, 2011 at 2:19 AM, Shay Banon kimchy@gmail.com wrote:

You can't change a field from not being indexed to being indexed. You can
change the default analyzer, but this will only affect future documents
indexed (by closing the index, updating the index settings, and then opening
it).

If you end up reindexing the data, why not just index it into a new index
with the new mappings?

On Mon, Sep 12, 2011 at 3:00 AM, Curtis Caravone caravone@gmail.comwrote:

We are in a situation where we need to reindex a few hundred million
docs (add some indexed fields, add some new fields). We hope to do this
online by changing the mapping then performing updates on all the old docs
to reindex them with the new mapping.

Along these lines, I have a couple of questions:

  1. Can a field mapping be changed from "index":"no" to to indexed using
    the put mapping API?
  2. Can the default analyzer be changed with put mapping API?

thanks,

Curtis


(Curtis Caravone) #7

Ok, I think that answers my questions.

The reason I proposed using multiple indices with aliases is that I want to
be able to take in new data while the migration is taking place. For
example:

index_old: index with old mapping and data
index_new: index with new mapping

At some point, switch writes to index_new, so it starts filling up with new
data.
Then, migrate data from old to new while the system is online and still
taking new data.

I want to be able to search across both old and new during this process,
similar to an online rebuild of a database index.

Is there a better way I could be doing this?

Curtis

On Tue, Sep 13, 2011 at 1:16 PM, Shay Banon kimchy@gmail.com wrote:

I think you are trying to use the aliases wrongly. When you reindex and
want to do "hot" replace of indices, you use a single alias pointing to a
single index as the one the "client" uses. For example, have alias1 point to
index1. Then, you reindex the data into index2, and once its done, you
switch (in the same command) alias1 to point to index2 from index1.

In the above usecase, there is no point where an alias is pointing to more
than one index. Of course, you can have an alias point to more than one
index, but it only really make sense when searching, and in this case, its
the same as searching across several indices, where the index is part of the
"uniqueness" of the document.

On Tue, Sep 13, 2011 at 9:05 PM, Curtis Caravone caravone@gmail.comwrote:

Ok, I'm going with the strategy of creating a new index, but I want to do
the reindexing all online using aliases.

That leads to a couple of alias questions:

  1. How are doc ids treated when you do a get (or search) operation on an
    alias with multiple underlying indices?
  2. What happens if two docs with the same id exist in two of the
    underlying indices? Is there some precedence or order to which doc is
    returned?
  3. Does the index status API work with aliases? For example, can I wait
    for yellow status on an alias rather than listing all the underlying
    indices?

Thanks again,

Curtis

On Mon, Sep 12, 2011 at 11:18 AM, Curtis Caravone caravone@gmail.comwrote:

Ok, thanks. In that case I will go the route of indexing into a new
index.

Curtis

On Mon, Sep 12, 2011 at 2:19 AM, Shay Banon kimchy@gmail.com wrote:

You can't change a field from not being indexed to being indexed. You
can change the default analyzer, but this will only affect future documents
indexed (by closing the index, updating the index settings, and then opening
it).

If you end up reindexing the data, why not just index it into a new
index with the new mappings?

On Mon, Sep 12, 2011 at 3:00 AM, Curtis Caravone caravone@gmail.comwrote:

We are in a situation where we need to reindex a few hundred million
docs (add some indexed fields, add some new fields). We hope to do this
online by changing the mapping then performing updates on all the old docs
to reindex them with the new mapping.

Along these lines, I have a couple of questions:

  1. Can a field mapping be changed from "index":"no" to to indexed
    using the put mapping API?
  2. Can the default analyzer be changed with put mapping API?

thanks,

Curtis


(Shay Banon) #8

Taking data while having the old index around can be problematic if you have
updates / deletes. You might need to still apply the changes to the old
index, and buffer them while you reindex to apply them again or something
like that.

On Wed, Sep 14, 2011 at 12:02 AM, Curtis Caravone caravone@gmail.comwrote:

Ok, I think that answers my questions.

The reason I proposed using multiple indices with aliases is that I want to
be able to take in new data while the migration is taking place. For
example:

index_old: index with old mapping and data
index_new: index with new mapping

At some point, switch writes to index_new, so it starts filling up with new
data.
Then, migrate data from old to new while the system is online and still
taking new data.

I want to be able to search across both old and new during this process,
similar to an online rebuild of a database index.

Is there a better way I could be doing this?

Curtis

On Tue, Sep 13, 2011 at 1:16 PM, Shay Banon kimchy@gmail.com wrote:

I think you are trying to use the aliases wrongly. When you reindex and
want to do "hot" replace of indices, you use a single alias pointing to a
single index as the one the "client" uses. For example, have alias1 point to
index1. Then, you reindex the data into index2, and once its done, you
switch (in the same command) alias1 to point to index2 from index1.

In the above usecase, there is no point where an alias is pointing to more
than one index. Of course, you can have an alias point to more than one
index, but it only really make sense when searching, and in this case, its
the same as searching across several indices, where the index is part of the
"uniqueness" of the document.

On Tue, Sep 13, 2011 at 9:05 PM, Curtis Caravone caravone@gmail.comwrote:

Ok, I'm going with the strategy of creating a new index, but I want to do
the reindexing all online using aliases.

That leads to a couple of alias questions:

  1. How are doc ids treated when you do a get (or search) operation on an
    alias with multiple underlying indices?
  2. What happens if two docs with the same id exist in two of the
    underlying indices? Is there some precedence or order to which doc is
    returned?
  3. Does the index status API work with aliases? For example, can I wait
    for yellow status on an alias rather than listing all the underlying
    indices?

Thanks again,

Curtis

On Mon, Sep 12, 2011 at 11:18 AM, Curtis Caravone caravone@gmail.comwrote:

Ok, thanks. In that case I will go the route of indexing into a new
index.

Curtis

On Mon, Sep 12, 2011 at 2:19 AM, Shay Banon kimchy@gmail.com wrote:

You can't change a field from not being indexed to being indexed. You
can change the default analyzer, but this will only affect future documents
indexed (by closing the index, updating the index settings, and then opening
it).

If you end up reindexing the data, why not just index it into a new
index with the new mappings?

On Mon, Sep 12, 2011 at 3:00 AM, Curtis Caravone caravone@gmail.comwrote:

We are in a situation where we need to reindex a few hundred million
docs (add some indexed fields, add some new fields). We hope to do this
online by changing the mapping then performing updates on all the old docs
to reindex them with the new mapping.

Along these lines, I have a couple of questions:

  1. Can a field mapping be changed from "index":"no" to to indexed
    using the put mapping API?
  2. Can the default analyzer be changed with put mapping API?

thanks,

Curtis


(Curtis Caravone) #9

Good point, I'll have to think about this some more.

Curtis

On Tue, Sep 13, 2011 at 2:15 PM, Shay Banon kimchy@gmail.com wrote:

Taking data while having the old index around can be problematic if you
have updates / deletes. You might need to still apply the changes to the old
index, and buffer them while you reindex to apply them again or something
like that.

On Wed, Sep 14, 2011 at 12:02 AM, Curtis Caravone caravone@gmail.comwrote:

Ok, I think that answers my questions.

The reason I proposed using multiple indices with aliases is that I want
to be able to take in new data while the migration is taking place. For
example:

index_old: index with old mapping and data
index_new: index with new mapping

At some point, switch writes to index_new, so it starts filling up with
new data.
Then, migrate data from old to new while the system is online and still
taking new data.

I want to be able to search across both old and new during this process,
similar to an online rebuild of a database index.

Is there a better way I could be doing this?

Curtis

On Tue, Sep 13, 2011 at 1:16 PM, Shay Banon kimchy@gmail.com wrote:

I think you are trying to use the aliases wrongly. When you reindex and
want to do "hot" replace of indices, you use a single alias pointing to a
single index as the one the "client" uses. For example, have alias1 point to
index1. Then, you reindex the data into index2, and once its done, you
switch (in the same command) alias1 to point to index2 from index1.

In the above usecase, there is no point where an alias is pointing to
more than one index. Of course, you can have an alias point to more than one
index, but it only really make sense when searching, and in this case, its
the same as searching across several indices, where the index is part of the
"uniqueness" of the document.

On Tue, Sep 13, 2011 at 9:05 PM, Curtis Caravone caravone@gmail.comwrote:

Ok, I'm going with the strategy of creating a new index, but I want to
do the reindexing all online using aliases.

That leads to a couple of alias questions:

  1. How are doc ids treated when you do a get (or search) operation on
    an alias with multiple underlying indices?
  2. What happens if two docs with the same id exist in two of the
    underlying indices? Is there some precedence or order to which doc is
    returned?
  3. Does the index status API work with aliases? For example, can I
    wait for yellow status on an alias rather than listing all the underlying
    indices?

Thanks again,

Curtis

On Mon, Sep 12, 2011 at 11:18 AM, Curtis Caravone caravone@gmail.comwrote:

Ok, thanks. In that case I will go the route of indexing into a new
index.

Curtis

On Mon, Sep 12, 2011 at 2:19 AM, Shay Banon kimchy@gmail.com wrote:

You can't change a field from not being indexed to being indexed. You
can change the default analyzer, but this will only affect future documents
indexed (by closing the index, updating the index settings, and then opening
it).

If you end up reindexing the data, why not just index it into a new
index with the new mappings?

On Mon, Sep 12, 2011 at 3:00 AM, Curtis Caravone caravone@gmail.comwrote:

We are in a situation where we need to reindex a few hundred million
docs (add some indexed fields, add some new fields). We hope to do this
online by changing the mapping then performing updates on all the old docs
to reindex them with the new mapping.

Along these lines, I have a couple of questions:

  1. Can a field mapping be changed from "index":"no" to to indexed
    using the put mapping API?
  2. Can the default analyzer be changed with put mapping API?

thanks,

Curtis


(Stephen Beeson) #10

Hi Guys,
I found this thread was a useful starting point to handle my own simliar
issues. I did come up with another solution but wanted to run it by you to
see if you can spot any issues with it.

The basic concept is to perform all reads of an index with one alias and
all writes with another, eg. index_read and index_write
When a reindex was performed, another index would be created and added to
the index_write alias

index_read -> old_index

index_write ->  old_index
                      new_index

This way all updates come through on the new index while it is being
populated and the old index continues to return newly updated documents.

Once the reindexing is complete, I'd remove the old index and add the new
to the read alias, and remove the old index from the write alias.
Then I could delete the old and it should all be quite seemless.

I have my own python code to access elasticsearch so the code that
references the two aliases in all in the one place.

Does this seems like a reasonable approach?

I think this could be made even easier if there was a mechanism similar to
the "search_routing" and "index_routing" features. Perhaps the aliases
could have add_search and add_index actions. Kimchy, does this sound like a
sensible feature?

Thanks
Stephen

On Wednesday, September 14, 2011 7:23:04 AM UTC+10, Curtis Caravone wrote:

Good point, I'll have to think about this some more.

Curtis

On Tue, Sep 13, 2011 at 2:15 PM, Shay Banon kimchy@gmail.com wrote:

Taking data while having the old index around can be problematic if you
have updates / deletes. You might need to still apply the changes to the
old index, and buffer them while you reindex to apply them again or
something like that.

On Wed, Sep 14, 2011 at 12:02 AM, Curtis Caravone caravone@gmail.comwrote:

Ok, I think that answers my questions.

The reason I proposed using multiple indices with aliases is that I want
to be able to take in new data while the migration is taking place. For
example:

index_old: index with old mapping and data
index_new: index with new mapping

At some point, switch writes to index_new, so it starts filling up with
new data.
Then, migrate data from old to new while the system is online and still
taking new data.

I want to be able to search across both old and new during this process,
similar to an online rebuild of a database index.

Is there a better way I could be doing this?

Curtis

On Tue, Sep 13, 2011 at 1:16 PM, Shay Banon kimchy@gmail.com wrote:

I think you are trying to use the aliases wrongly. When you reindex and
want to do "hot" replace of indices, you use a single alias pointing to a
single index as the one the "client" uses. For example, have alias1 point
to index1. Then, you reindex the data into index2, and once its done, you
switch (in the same command) alias1 to point to index2 from index1.

In the above usecase, there is no point where an alias is pointing to
more than one index. Of course, you can have an alias point to more than
one index, but it only really make sense when searching, and in this case,
its the same as searching across several indices, where the index is part
of the "uniqueness" of the document.

On Tue, Sep 13, 2011 at 9:05 PM, Curtis Caravone caravone@gmail.comwrote:

Ok, I'm going with the strategy of creating a new index, but I want to
do the reindexing all online using aliases.

That leads to a couple of alias questions:

  1. How are doc ids treated when you do a get (or search) operation on
    an alias with multiple underlying indices?
  2. What happens if two docs with the same id exist in two of the
    underlying indices? Is there some precedence or order to which doc is
    returned?
  3. Does the index status API work with aliases? For example, can I
    wait for yellow status on an alias rather than listing all the underlying
    indices?

Thanks again,

Curtis

On Mon, Sep 12, 2011 at 11:18 AM, Curtis Caravone caravone@gmail.comwrote:

Ok, thanks. In that case I will go the route of indexing into a new
index.

Curtis

On Mon, Sep 12, 2011 at 2:19 AM, Shay Banon kimchy@gmail.com wrote:

You can't change a field from not being indexed to being indexed.
You can change the default analyzer, but this will only affect future
documents indexed (by closing the index, updating the index settings, and
then opening it).

If you end up reindexing the data, why not just index it into a new
index with the new mappings?

On Mon, Sep 12, 2011 at 3:00 AM, Curtis Caravone <caravone@gmail.com

wrote:

We are in a situation where we need to reindex a few hundred
million docs (add some indexed fields, add some new fields). We hope to do
this online by changing the mapping then performing updates on all the old
docs to reindex them with the new mapping.

Along these lines, I have a couple of questions:

  1. Can a field mapping be changed from "index":"no" to to indexed
    using the put mapping API?
  2. Can the default analyzer be changed with put mapping API?

thanks,

Curtis


(benny.sadeh) #11

I can see two problems with your scheme:

  1. what about a delete of a doc before it was reindexed?
  2. what about a read of an updated doc before the reindexing is done?

On Monday, July 23, 2012 9:07:38 PM UTC-10, Stephen Beeson wrote:

Hi Guys,
I found this thread was a useful starting point to handle my own simliar
issues. I did come up with another solution but wanted to run it by you to
see if you can spot any issues with it.

The basic concept is to perform all reads of an index with one alias and
all writes with another, eg. index_read and index_write
When a reindex was performed, another index would be created and added to
the index_write alias

index_read -> old_index

index_write ->  old_index
                      new_index

This way all updates come through on the new index while it is being
populated and the old index continues to return newly updated documents.

Once the reindexing is complete, I'd remove the old index and add the new
to the read alias, and remove the old index from the write alias.
Then I could delete the old and it should all be quite seemless.

I have my own python code to access elasticsearch so the code that
references the two aliases in all in the one place.

Does this seems like a reasonable approach?

I think this could be made even easier if there was a mechanism similar to
the "search_routing" and "index_routing" features. Perhaps the aliases
could have add_search and add_index actions. Kimchy, does this sound like a
sensible feature?

Thanks
Stephen

On Wednesday, September 14, 2011 7:23:04 AM UTC+10, Curtis Caravone wrote:

Good point, I'll have to think about this some more.

Curtis

On Tue, Sep 13, 2011 at 2:15 PM, Shay Banon kimchy@gmail.com wrote:

Taking data while having the old index around can be problematic if you
have updates / deletes. You might need to still apply the changes to the
old index, and buffer them while you reindex to apply them again or
something like that.

On Wed, Sep 14, 2011 at 12:02 AM, Curtis Caravone caravone@gmail.comwrote:

Ok, I think that answers my questions.

The reason I proposed using multiple indices with aliases is that I
want to be able to take in new data while the migration is taking place.
For example:

index_old: index with old mapping and data
index_new: index with new mapping

At some point, switch writes to index_new, so it starts filling up with
new data.
Then, migrate data from old to new while the system is online and still
taking new data.

I want to be able to search across both old and new during this
process, similar to an online rebuild of a database index.

Is there a better way I could be doing this?

Curtis

On Tue, Sep 13, 2011 at 1:16 PM, Shay Banon kimchy@gmail.com wrote:

I think you are trying to use the aliases wrongly. When you reindex
and want to do "hot" replace of indices, you use a single alias pointing to
a single index as the one the "client" uses. For example, have alias1 point
to index1. Then, you reindex the data into index2, and once its done, you
switch (in the same command) alias1 to point to index2 from index1.

In the above usecase, there is no point where an alias is pointing to
more than one index. Of course, you can have an alias point to more than
one index, but it only really make sense when searching, and in this case,
its the same as searching across several indices, where the index is part
of the "uniqueness" of the document.

On Tue, Sep 13, 2011 at 9:05 PM, Curtis Caravone caravone@gmail.comwrote:

Ok, I'm going with the strategy of creating a new index, but I want
to do the reindexing all online using aliases.

That leads to a couple of alias questions:

  1. How are doc ids treated when you do a get (or search) operation
    on an alias with multiple underlying indices?
  2. What happens if two docs with the same id exist in two of the
    underlying indices? Is there some precedence or order to which doc is
    returned?
  3. Does the index status API work with aliases? For example, can I
    wait for yellow status on an alias rather than listing all the underlying
    indices?

Thanks again,

Curtis

On Mon, Sep 12, 2011 at 11:18 AM, Curtis Caravone <caravone@gmail.com

wrote:

Ok, thanks. In that case I will go the route of indexing into a new
index.

Curtis

On Mon, Sep 12, 2011 at 2:19 AM, Shay Banon kimchy@gmail.comwrote:

You can't change a field from not being indexed to being indexed.
You can change the default analyzer, but this will only affect future
documents indexed (by closing the index, updating the index settings, and
then opening it).

If you end up reindexing the data, why not just index it into a new
index with the new mappings?

On Mon, Sep 12, 2011 at 3:00 AM, Curtis Caravone <
caravone@gmail.com> wrote:

We are in a situation where we need to reindex a few hundred
million docs (add some indexed fields, add some new fields). We hope to do
this online by changing the mapping then performing updates on all the old
docs to reindex them with the new mapping.

Along these lines, I have a couple of questions:

  1. Can a field mapping be changed from "index":"no" to to indexed
    using the put mapping API?
  2. Can the default analyzer be changed with put mapping API?

thanks,

Curtis


(Chris Male) #12

Hi Stephen,

I was wondering why have you chosen to separate reads and writes into
separate indexes? Is this for performance? Were you seeing search
performance degradation when indexing to the same index?

On Tuesday, July 24, 2012 7:07:38 PM UTC+12, Stephen Beeson wrote:

Hi Guys,
I found this thread was a useful starting point to handle my own simliar
issues. I did come up with another solution but wanted to run it by you to
see if you can spot any issues with it.

The basic concept is to perform all reads of an index with one alias and
all writes with another, eg. index_read and index_write
When a reindex was performed, another index would be created and added to
the index_write alias

index_read -> old_index

index_write ->  old_index
                      new_index

This way all updates come through on the new index while it is being
populated and the old index continues to return newly updated documents.

Once the reindexing is complete, I'd remove the old index and add the new
to the read alias, and remove the old index from the write alias.
Then I could delete the old and it should all be quite seemless.

I have my own python code to access elasticsearch so the code that
references the two aliases in all in the one place.

Does this seems like a reasonable approach?

I think this could be made even easier if there was a mechanism similar to
the "search_routing" and "index_routing" features. Perhaps the aliases
could have add_search and add_index actions. Kimchy, does this sound like a
sensible feature?

Thanks
Stephen

On Wednesday, September 14, 2011 7:23:04 AM UTC+10, Curtis Caravone wrote:

Good point, I'll have to think about this some more.

Curtis

On Tue, Sep 13, 2011 at 2:15 PM, Shay Banon kimchy@gmail.com wrote:

Taking data while having the old index around can be problematic if you
have updates / deletes. You might need to still apply the changes to the
old index, and buffer them while you reindex to apply them again or
something like that.

On Wed, Sep 14, 2011 at 12:02 AM, Curtis Caravone caravone@gmail.comwrote:

Ok, I think that answers my questions.

The reason I proposed using multiple indices with aliases is that I
want to be able to take in new data while the migration is taking place.
For example:

index_old: index with old mapping and data
index_new: index with new mapping

At some point, switch writes to index_new, so it starts filling up with
new data.
Then, migrate data from old to new while the system is online and still
taking new data.

I want to be able to search across both old and new during this
process, similar to an online rebuild of a database index.

Is there a better way I could be doing this?

Curtis

On Tue, Sep 13, 2011 at 1:16 PM, Shay Banon kimchy@gmail.com wrote:

I think you are trying to use the aliases wrongly. When you reindex
and want to do "hot" replace of indices, you use a single alias pointing to
a single index as the one the "client" uses. For example, have alias1 point
to index1. Then, you reindex the data into index2, and once its done, you
switch (in the same command) alias1 to point to index2 from index1.

In the above usecase, there is no point where an alias is pointing to
more than one index. Of course, you can have an alias point to more than
one index, but it only really make sense when searching, and in this case,
its the same as searching across several indices, where the index is part
of the "uniqueness" of the document.

On Tue, Sep 13, 2011 at 9:05 PM, Curtis Caravone caravone@gmail.comwrote:

Ok, I'm going with the strategy of creating a new index, but I want
to do the reindexing all online using aliases.

That leads to a couple of alias questions:

  1. How are doc ids treated when you do a get (or search) operation
    on an alias with multiple underlying indices?
  2. What happens if two docs with the same id exist in two of the
    underlying indices? Is there some precedence or order to which doc is
    returned?
  3. Does the index status API work with aliases? For example, can I
    wait for yellow status on an alias rather than listing all the underlying
    indices?

Thanks again,

Curtis

On Mon, Sep 12, 2011 at 11:18 AM, Curtis Caravone <caravone@gmail.com

wrote:

Ok, thanks. In that case I will go the route of indexing into a new
index.

Curtis

On Mon, Sep 12, 2011 at 2:19 AM, Shay Banon kimchy@gmail.comwrote:

You can't change a field from not being indexed to being indexed.
You can change the default analyzer, but this will only affect future
documents indexed (by closing the index, updating the index settings, and
then opening it).

If you end up reindexing the data, why not just index it into a new
index with the new mappings?

On Mon, Sep 12, 2011 at 3:00 AM, Curtis Caravone <
caravone@gmail.com> wrote:

We are in a situation where we need to reindex a few hundred
million docs (add some indexed fields, add some new fields). We hope to do
this online by changing the mapping then performing updates on all the old
docs to reindex them with the new mapping.

Along these lines, I have a couple of questions:

  1. Can a field mapping be changed from "index":"no" to to indexed
    using the put mapping API?
  2. Can the default analyzer be changed with put mapping API?

thanks,

Curtis


(Stephen Beeson) #13

No, I would only be using separate indexes while we were reindexing. This
setup is to allow us to reindex safely in a live system, similar to Kimchys
post:

"When you reindex and want to do "hot" replace of indices, you use a single
alias pointing to a single index as the one the "client" uses. For example,
have alias1 point to index1. Then, you reindex the data into index2, and
once its done, you switch (in the same command) alias1 to point to index2
from index1. "

On Thursday, July 26, 2012 2:03:14 PM UTC+10, Chris Male wrote:

Hi Stephen,

I was wondering why have you chosen to separate reads and writes into
separate indexes? Is this for performance? Were you seeing search
performance degradation when indexing to the same index?

On Tuesday, July 24, 2012 7:07:38 PM UTC+12, Stephen Beeson wrote:

Hi Guys,
I found this thread was a useful starting point to handle my own simliar
issues. I did come up with another solution but wanted to run it by you to
see if you can spot any issues with it.

The basic concept is to perform all reads of an index with one alias and
all writes with another, eg. index_read and index_write
When a reindex was performed, another index would be created and added to
the index_write alias

index_read -> old_index

index_write ->  old_index
                      new_index

This way all updates come through on the new index while it is being
populated and the old index continues to return newly updated documents.

Once the reindexing is complete, I'd remove the old index and add the new
to the read alias, and remove the old index from the write alias.
Then I could delete the old and it should all be quite seemless.

I have my own python code to access elasticsearch so the code that
references the two aliases in all in the one place.

Does this seems like a reasonable approach?

I think this could be made even easier if there was a mechanism similar
to the "search_routing" and "index_routing" features. Perhaps the
aliases could have add_search and add_index actions. Kimchy, does this
sound like a sensible feature?

Thanks
Stephen

On Wednesday, September 14, 2011 7:23:04 AM UTC+10, Curtis Caravone wrote:

Good point, I'll have to think about this some more.

Curtis

On Tue, Sep 13, 2011 at 2:15 PM, Shay Banon kimchy@gmail.com wrote:

Taking data while having the old index around can be problematic if you
have updates / deletes. You might need to still apply the changes to the
old index, and buffer them while you reindex to apply them again or
something like that.

On Wed, Sep 14, 2011 at 12:02 AM, Curtis Caravone caravone@gmail.comwrote:

Ok, I think that answers my questions.

The reason I proposed using multiple indices with aliases is that I
want to be able to take in new data while the migration is taking place.
For example:

index_old: index with old mapping and data
index_new: index with new mapping

At some point, switch writes to index_new, so it starts filling up
with new data.
Then, migrate data from old to new while the system is online and
still taking new data.

I want to be able to search across both old and new during this
process, similar to an online rebuild of a database index.

Is there a better way I could be doing this?

Curtis

On Tue, Sep 13, 2011 at 1:16 PM, Shay Banon kimchy@gmail.com wrote:

I think you are trying to use the aliases wrongly. When you reindex
and want to do "hot" replace of indices, you use a single alias pointing to
a single index as the one the "client" uses. For example, have alias1 point
to index1. Then, you reindex the data into index2, and once its done, you
switch (in the same command) alias1 to point to index2 from index1.

In the above usecase, there is no point where an alias is pointing to
more than one index. Of course, you can have an alias point to more than
one index, but it only really make sense when searching, and in this case,
its the same as searching across several indices, where the index is part
of the "uniqueness" of the document.

On Tue, Sep 13, 2011 at 9:05 PM, Curtis Caravone caravone@gmail.comwrote:

Ok, I'm going with the strategy of creating a new index, but I want
to do the reindexing all online using aliases.

That leads to a couple of alias questions:

  1. How are doc ids treated when you do a get (or search) operation
    on an alias with multiple underlying indices?
  2. What happens if two docs with the same id exist in two of the
    underlying indices? Is there some precedence or order to which doc is
    returned?
  3. Does the index status API work with aliases? For example, can I
    wait for yellow status on an alias rather than listing all the underlying
    indices?

Thanks again,

Curtis

On Mon, Sep 12, 2011 at 11:18 AM, Curtis Caravone <
caravone@gmail.com> wrote:

Ok, thanks. In that case I will go the route of indexing into a
new index.

Curtis

On Mon, Sep 12, 2011 at 2:19 AM, Shay Banon kimchy@gmail.comwrote:

You can't change a field from not being indexed to being indexed.
You can change the default analyzer, but this will only affect future
documents indexed (by closing the index, updating the index settings, and
then opening it).

If you end up reindexing the data, why not just index it into a
new index with the new mappings?

On Mon, Sep 12, 2011 at 3:00 AM, Curtis Caravone <
caravone@gmail.com> wrote:

We are in a situation where we need to reindex a few hundred
million docs (add some indexed fields, add some new fields). We hope to do
this online by changing the mapping then performing updates on all the old
docs to reindex them with the new mapping.

Along these lines, I have a couple of questions:

  1. Can a field mapping be changed from "index":"no" to to
    indexed using the put mapping API?
  2. Can the default analyzer be changed with put mapping API?

thanks,

Curtis


(Stephen Beeson) #14

Ok, so the detail that breaks all of this that I missed is:
"It is an error to index to an alias which points to more than one
index."

Not sure why this is, but it ends this experiment.

On Thursday, July 26, 2012 1:44:32 PM UTC+10, benny.sadeh wrote:

I can see two problems with your scheme:

  1. what about a delete of a doc before it was reindexed?
  2. what about a read of an updated doc before the reindexing is done?

On Monday, July 23, 2012 9:07:38 PM UTC-10, Stephen Beeson wrote:

Hi Guys,
I found this thread was a useful starting point to handle my own simliar
issues. I did come up with another solution but wanted to run it by you to
see if you can spot any issues with it.

The basic concept is to perform all reads of an index with one alias and
all writes with another, eg. index_read and index_write
When a reindex was performed, another index would be created and added to
the index_write alias

index_read -> old_index

index_write ->  old_index
                      new_index

This way all updates come through on the new index while it is being
populated and the old index continues to return newly updated documents.

Once the reindexing is complete, I'd remove the old index and add the new
to the read alias, and remove the old index from the write alias.
Then I could delete the old and it should all be quite seemless.

I have my own python code to access elasticsearch so the code that
references the two aliases in all in the one place.

Does this seems like a reasonable approach?

I think this could be made even easier if there was a mechanism similar
to the "search_routing" and "index_routing" features. Perhaps the
aliases could have add_search and add_index actions. Kimchy, does this
sound like a sensible feature?

Thanks
Stephen

On Wednesday, September 14, 2011 7:23:04 AM UTC+10, Curtis Caravone wrote:

Good point, I'll have to think about this some more.

Curtis

On Tue, Sep 13, 2011 at 2:15 PM, Shay Banon kimchy@gmail.com wrote:

Taking data while having the old index around can be problematic if you
have updates / deletes. You might need to still apply the changes to the
old index, and buffer them while you reindex to apply them again or
something like that.

On Wed, Sep 14, 2011 at 12:02 AM, Curtis Caravone caravone@gmail.comwrote:

Ok, I think that answers my questions.

The reason I proposed using multiple indices with aliases is that I
want to be able to take in new data while the migration is taking place.
For example:

index_old: index with old mapping and data
index_new: index with new mapping

At some point, switch writes to index_new, so it starts filling up
with new data.
Then, migrate data from old to new while the system is online and
still taking new data.

I want to be able to search across both old and new during this
process, similar to an online rebuild of a database index.

Is there a better way I could be doing this?

Curtis

On Tue, Sep 13, 2011 at 1:16 PM, Shay Banon kimchy@gmail.com wrote:

I think you are trying to use the aliases wrongly. When you reindex
and want to do "hot" replace of indices, you use a single alias pointing to
a single index as the one the "client" uses. For example, have alias1 point
to index1. Then, you reindex the data into index2, and once its done, you
switch (in the same command) alias1 to point to index2 from index1.

In the above usecase, there is no point where an alias is pointing to
more than one index. Of course, you can have an alias point to more than
one index, but it only really make sense when searching, and in this case,
its the same as searching across several indices, where the index is part
of the "uniqueness" of the document.

On Tue, Sep 13, 2011 at 9:05 PM, Curtis Caravone caravone@gmail.comwrote:

Ok, I'm going with the strategy of creating a new index, but I want
to do the reindexing all online using aliases.

That leads to a couple of alias questions:

  1. How are doc ids treated when you do a get (or search) operation
    on an alias with multiple underlying indices?
  2. What happens if two docs with the same id exist in two of the
    underlying indices? Is there some precedence or order to which doc is
    returned?
  3. Does the index status API work with aliases? For example, can I
    wait for yellow status on an alias rather than listing all the underlying
    indices?

Thanks again,

Curtis

On Mon, Sep 12, 2011 at 11:18 AM, Curtis Caravone <
caravone@gmail.com> wrote:

Ok, thanks. In that case I will go the route of indexing into a
new index.

Curtis

On Mon, Sep 12, 2011 at 2:19 AM, Shay Banon kimchy@gmail.comwrote:

You can't change a field from not being indexed to being indexed.
You can change the default analyzer, but this will only affect future
documents indexed (by closing the index, updating the index settings, and
then opening it).

If you end up reindexing the data, why not just index it into a
new index with the new mappings?

On Mon, Sep 12, 2011 at 3:00 AM, Curtis Caravone <
caravone@gmail.com> wrote:

We are in a situation where we need to reindex a few hundred
million docs (add some indexed fields, add some new fields). We hope to do
this online by changing the mapping then performing updates on all the old
docs to reindex them with the new mapping.

Along these lines, I have a couple of questions:

  1. Can a field mapping be changed from "index":"no" to to
    indexed using the put mapping API?
  2. Can the default analyzer be changed with put mapping API?

thanks,

Curtis


(system) #15