Copying documents between indices


(Matthew A. Brown) #1

Hi all,

Sorry for the rapid-fire posts, but here's another topic I'd like to
bring up. As I'm working on a rapidly-evolving application with
ElasticSearch as a central component, we often need to make mapping
changes that cannot be applied directly via a push mapping.
Accordingly, we use index aliases (a most excellent feature, I must
say) to set up a migration framework.

Essentially, let's say I have a semantic index 'my_index'. At any
given time, 'my_index' will be aliased to a real index with a
timestamp appended, e.g. 'my_index-123456789'. If we need to push a
mapping change that can't be done in-place, we create a new index, say
'my_index-1319658279'. We perform a scan search over the old index,
writing the documents to the new index as we go. When that's done, we
re-alias 'my_index' to the newly created index, and delete the old
one.

This works great, but it's not especially fast, even using bulk
writes. My guess is that a lot of the overhead is just moving data
over the wire via HTTP. Since the migration involves no business
logic, it seems reasonable that ES could expose functionality to do
the same thing locally.

Essentially, what I have in mind is API endpoints that expose:

  • Copy a document from one index to another
  • Copy the entire contents of one index to another

I'm happy to explore writing this sort of functionality as a plugin,
but I'm wondering if anyone has any thoughts on the idea.

Thanks!
Mat


(David Pilato) #2

Seems to be relative to https://github.com/elasticsearch/elasticsearch/issues/1077

I start to write something a week ago but as I was writing the river, I was wondering if a simple "java batch" is not better than a river ?

This is what I have in mind by now.

David

Le 26 oct. 2011 à 21:47, "Matthew A. Brown" mat.a.brown@gmail.com a écrit :

Hi all,

Sorry for the rapid-fire posts, but here's another topic I'd like to
bring up. As I'm working on a rapidly-evolving application with
ElasticSearch as a central component, we often need to make mapping
changes that cannot be applied directly via a push mapping.
Accordingly, we use index aliases (a most excellent feature, I must
say) to set up a migration framework.

Essentially, let's say I have a semantic index 'my_index'. At any
given time, 'my_index' will be aliased to a real index with a
timestamp appended, e.g. 'my_index-123456789'. If we need to push a
mapping change that can't be done in-place, we create a new index, say
'my_index-1319658279'. We perform a scan search over the old index,
writing the documents to the new index as we go. When that's done, we
re-alias 'my_index' to the newly created index, and delete the old
one.

This works great, but it's not especially fast, even using bulk
writes. My guess is that a lot of the overhead is just moving data
over the wire via HTTP. Since the migration involves no business
logic, it seems reasonable that ES could expose functionality to do
the same thing locally.

Essentially, what I have in mind is API endpoints that expose:

  • Copy a document from one index to another
  • Copy the entire contents of one index to another

I'm happy to explore writing this sort of functionality as a plugin,
but I'm wondering if anyone has any thoughts on the idea.

Thanks!
Mat


(Matthew A. Brown) #3

Ah, I see I'm far from the first to have this need. Agreed with you
(and with Shay's comments in a previous thread) that just doing this
all inside the ES process seems to make the most sense. I'm happy to
try to take this on as a plugin and I think I should have some
bandwidth at work to do it. Presumably one would not want it to be a
blocking call (at least by default) so that will probably be one
challenge : )

On Wed, Oct 26, 2011 at 16:36, David Pilato david@pilato.fr wrote:

Seems to be relative to https://github.com/elasticsearch/elasticsearch/issues/1077

I start to write something a week ago but as I was writing the river, I was wondering if a simple "java batch" is not better than a river ?

This is what I have in mind by now.

David

Le 26 oct. 2011 à 21:47, "Matthew A. Brown" mat.a.brown@gmail.com a écrit :

Hi all,

Sorry for the rapid-fire posts, but here's another topic I'd like to
bring up. As I'm working on a rapidly-evolving application with
ElasticSearch as a central component, we often need to make mapping
changes that cannot be applied directly via a push mapping.
Accordingly, we use index aliases (a most excellent feature, I must
say) to set up a migration framework.

Essentially, let's say I have a semantic index 'my_index'. At any
given time, 'my_index' will be aliased to a real index with a
timestamp appended, e.g. 'my_index-123456789'. If we need to push a
mapping change that can't be done in-place, we create a new index, say
'my_index-1319658279'. We perform a scan search over the old index,
writing the documents to the new index as we go. When that's done, we
re-alias 'my_index' to the newly created index, and delete the old
one.

This works great, but it's not especially fast, even using bulk
writes. My guess is that a lot of the overhead is just moving data
over the wire via HTTP. Since the migration involves no business
logic, it seems reasonable that ES could expose functionality to do
the same thing locally.

Essentially, what I have in mind is API endpoints that expose:

  • Copy a document from one index to another
  • Copy the entire contents of one index to another

I'm happy to explore writing this sort of functionality as a plugin,
but I'm wondering if anyone has any thoughts on the idea.

Thanks!
Mat


(system) #4