Indexing strategies for quick re-indexing of large datasets

johnrodey · February 7, 2014, 7:16pm

We have a need to often re-index our data. For our ElasticSearch cluster we have relatively few machines, however we do have access to a very large hadoop cluster that we could use to build the indexes. I was wondering if anyone had any thoughts on ways to achieve this?

I was thinking that one option would be for each reducer to create its own index with a single shard and no backups, then copy those shard directories to my cluster under the same index. This way if I have 100 reducer tasks, when complete I would have an index with 100 shards on my search cluster. At this point I can tell it to make backups if desired too.

Is this possible/desireable? Are there going to be downfalls to doing it this way? I know ElasticSearch can have smarts when deciding which shard to send a document to for indexing, which I would lose. Please feel free to throw rocks at it.

Are there any other solutions?

Ideally I'm trying to find the best way to utilize my large hadoop cluster to build an index quickly that will be hosted by just a few machines.

THANKS!!!

johnrodey · February 11, 2014, 5:01pm

I have to imagine other folks have tried different strategies for quick re-indexing. Any insight is very much appreciated.

Thanks!

Topic		Replies	Views
Best reindex strategy to change only the number of Shards Elasticsearch	2	1163	July 6, 2017
How to deal with building huge bulk load indices fast without impacting prod queries or paying a fortune to over-provision the cluster Elasticsearch	10	3640	July 5, 2017
Offline indexing and expected scaling performance Elasticsearch	4	1836	July 6, 2017
One shard replication Elasticsearch	1	266	July 6, 2017
Is it possible to create indexes on one node of a cluster and copy it to another Elasticsearch	2	354	July 6, 2017

Indexing strategies for quick re-indexing of large datasets

Related topics