Indexing strategies for quick re-indexing of large datasets

(johnrodey) #1

We have a need to often re-index our data. For our ElasticSearch cluster we have relatively few machines, however we do have access to a very large hadoop cluster that we could use to build the indexes. I was wondering if anyone had any thoughts on ways to achieve this?

I was thinking that one option would be for each reducer to create its own index with a single shard and no backups, then copy those shard directories to my cluster under the same index. This way if I have 100 reducer tasks, when complete I would have an index with 100 shards on my search cluster. At this point I can tell it to make backups if desired too.

Is this possible/desireable? Are there going to be downfalls to doing it this way? I know ElasticSearch can have smarts when deciding which shard to send a document to for indexing, which I would lose. Please feel free to throw rocks at it.

Are there any other solutions?

Ideally I'm trying to find the best way to utilize my large hadoop cluster to build an index quickly that will be hosted by just a few machines.


(johnrodey) #2

I have to imagine other folks have tried different strategies for quick re-indexing. Any insight is very much appreciated.


(system) #3