How to reindex an ES index


(dpilato) #1

Hi,

Here is my usecase :
I have an ES index (myindex)
In this index, I have one document type (myindex/mydoc)
I have defined a mapping for this type (myindex/mydoc/_mapping)

I put many documents in this index (with default options so JSon source id stored in _source).

I want now to modify my mapping and ask to ES to reindex all my documents, based on the _source.

Is there any way to do that ?
Do I have to create an ES river plugin to be able for example to reindex data in a new index based on the _source field of documents in the first index ?

Is it the purpose of the _refresh API ? I don't think so.

Thanks for your feedbacks/advices
David


(dpilato) #2

BTW, is it possible to remove spams from fashionalwallet as it make a mess in the web view of the mailing list !
http://elasticsearch-users.115913.n3.nabble.com/

Thanks...


(Clinton Gormley) #3

Hi David

I want now to modify my mapping and ask to ES to reindex all my documents,
based on the _source.

Is there any way to do that ?

The easiest way to do this is to script it using whatever API you're
using to access ES ie:

  • create new index with new mapping
  • do a scrolled match_all search with search_type=scan of the old index
  • while the scrolled search is still returning results:
    • change the _source if you need to
    • bulk_index eg 1,000 docs to the new index
  • done

I think the python ruby and Perl APIs all provide convenience methods to
help you with the process, eg:

http://blogs.perl.org/users/clinton_gormley/2011/04/elasticsearchpm-v036-now-with-extra-sugar.html

clint


(David Pilato) #4

Many thanks Clint for this. I will try it (just have to check if I have perl on my server :wink: !)

Do you think that it could make sense of creating a ReplicateRiver to do that job with simple curl calls ?
As simple as we can find the replicate function in CouchDB...

Thanks again,
David.

-----Message d'origine-----
De : Clinton Gormley [mailto:clinton@iannounce.co.uk]
Envoyé : mardi 21 juin 2011 11:45
À : users@elasticsearch.com
Objet : Re: How to reindex an ES index

Hi David

I want now to modify my mapping and ask to ES to reindex all my documents,
based on the _source.

Is there any way to do that ?

The easiest way to do this is to script it using whatever API you're
using to access ES ie:

  • create new index with new mapping
  • do a scrolled match_all search with search_type=scan of the old index
  • while the scrolled search is still returning results:
    • change the _source if you need to
    • bulk_index eg 1,000 docs to the new index
  • done

I think the python ruby and Perl APIs all provide convenience methods to
help you with the process, eg:

http://blogs.perl.org/users/clinton_gormley/2011/04/elasticsearchpm-v036-now-with-extra-sugar.html

clint


(Shay Banon) #5

Why do you need a river for that? You can write it using the Java API as well.

On Tuesday, June 21, 2011 at 9:00 PM, David Pilato wrote:

Many thanks Clint for this. I will try it (just have to check if I have perl on my server :wink: !)

Do you think that it could make sense of creating a ReplicateRiver to do that job with simple curl calls ?
As simple as we can find the replicate function in CouchDB...

Thanks again,
David.

-----Message d'origine-----
De : Clinton Gormley [mailto:clinton@iannounce.co.uk]
Envoyé : mardi 21 juin 2011 11:45
À : users@elasticsearch.com (mailto:users@elasticsearch.com)
Objet : Re: How to reindex an ES index

Hi David

I want now to modify my mapping and ask to ES to reindex all my documents,
based on the _source.

Is there any way to do that ?

The easiest way to do this is to script it using whatever API you're
using to access ES ie:

  • create new index with new mapping
  • do a scrolled match_all search with search_type=scan of the old index
  • while the scrolled search is still returning results:
  • change the _source if you need to
  • bulk_index eg 1,000 docs to the new index
  • done

I think the python ruby and Perl APIs all provide convenience methods to
help you with the process, eg:

http://blogs.perl.org/users/clinton_gormley/2011/04/elasticsearchpm-v036-now-with-extra-sugar.html

clint


(dpilato) #6
Why do you need a river for that? You can write it using the Java API as well.

Sure.

I was thinking of a simple admin tool to reindex all datas in another ES cluster/index/type without a single line of code.

Nonsense ?


(Clinton Gormley) #7

Sure.

I was thinking of a simple admin tool to reindex all datas in another ES
cluster/index/type without a single line of code.

Nonsense ?

Why are you reindexing?

Either you want to change the mapping, or the index settings (number of
shards, analyzers) or you want to change the data itself.

In order to do this, you need be able to hook into various phases of the
reindexing process, and apply some logic. This, by its very nature,
implies some custom code.

Typically, this is easier to implement in the client code

clint


(David Pilato) #8

Why are you reindexing?
Because, I need to change my mapping (index existing fields)
Or I want to use another analyzer or to boost some fields.

As i don't change my document itself, ES has already it in the _source field. I just want to change my mapping and apply it to previous indexed docs.

If I merge the new mapping with the old one (if there is no conflict), I don't think that ES parse automatically previous docs and reindex them.

Am I wrong ?

Is there a best practice to reindex existing datas ?

Cheers
David :wink:

Le 23 juin 2011 à 15:12, Clinton Gormley clinton@iannounce.co.uk a écrit :

Sure.

I was thinking of a simple admin tool to reindex all datas in another ES
cluster/index/type without a single line of code.

Nonsense ?

Why are you reindexing?

Either you want to change the mapping, or the index settings (number of
shards, analyzers) or you want to change the data itself.

In order to do this, you need be able to hook into various phases of the
reindexing process, and apply some logic. This, by its very nature,
implies some custom code.

Typically, this is easier to implement in the client code

clint


(Shay Banon) #9

For cases where you can simply take the _source and index it into a new index (possibly with different mappings), then it makes sense. Having something built in elasticsearch for that does make sense, definitely. Its not a high priority since with the scan API, one can implement it on the client side.

On Friday, June 24, 2011 at 12:25 AM, David Pilato wrote:

Why are you reindexing?
Because, I need to change my mapping (index existing fields)
Or I want to use another analyzer or to boost some fields.

As i don't change my document itself, ES has already it in the _source field. I just want to change my mapping and apply it to previous indexed docs.

If I merge the new mapping with the old one (if there is no conflict), I don't think that ES parse automatically previous docs and reindex them.

Am I wrong ?

Is there a best practice to reindex existing datas ?

Cheers
David :wink:

Le 23 juin 2011 à 15:12, Clinton Gormley <clinton@iannounce.co.uk (mailto:clinton@iannounce.co.uk)> a écrit :

Sure.

I was thinking of a simple admin tool to reindex all datas in another ES
cluster/index/type without a single line of code.

Nonsense ?

Why are you reindexing?

Either you want to change the mapping, or the index settings (number of
shards, analyzers) or you want to change the data itself.

In order to do this, you need be able to hook into various phases of the
reindexing process, and apply some logic. This, by its very nature,
implies some custom code.

Typically, this is easier to implement in the client code

clint


(James Cook) #10

On Thu, Jun 23, 2011 at 9:12 AM, Clinton Gormley clinton@iannounce.co.ukwrote:

Sure.

I was thinking of a simple admin tool to reindex all datas in another ES
cluster/index/type without a single line of code.

Nonsense ?

Why are you reindexing?

Either you want to change the mapping, or the index settings (number of
shards, analyzers) or you want to change the data itself.

Or moving to a newer version of ES requires it.


(David Pilato) #11
For cases where you can simply take the _source and index it into a new index (possibly with different mappings), then it makes sense. Having something built in elasticsearch for that does make sense, definitely. Its not a high priority since with the scan API, one can implement it on the client side.

Thanks.

Just opened a Feature Request for that here : https://github.com/elasticsearch/elasticsearch/issues/1077

I will "try" to code it as writing rivers seems to be "easy"...


(system) #12