Ok, good to hear.
I think its requiring a little bit of metadata hacking because first we have x indices, each one having exactly one shard and then we need to convert that to 1 index having x shards and the shard count is stored in the metadata.. but it seems doable.
I guess i posted it in Hadoop because offline indexing can make a lot of sense in such an environment.
I assume you want to use standalone ES instances in order to reduce the network traffic and ensure that indexing is done locally, is this correct?
If so, what you can do, at least as long as you do not need to update your records at a later stage, is to set up all your indexing nodes in a cluster and configure the number of shards so that they have one primary shard each and no replicas. Then index a number of records so that at least one record has been written to each shard. Retrieve the id of one record from each shard and use this as a routing key for all records you want to index into that particular shard.
This will give you a single index with a number of shards, while it still allows all documents to be indexed locally only (as long as you provide the correct routing keys). It is not elegant, but I think it should work.
You cannot just "create" an index with shards outside of Elasticsearch, there are a number of things that it does during this process that doesn't lend itself to this process.
You're also not going to do any less work than actually doing it directly in Elasticsearch.
There's been other threads where people have asked this in this area as well, it'd be worth checking them.
All have the general advise "Don't do it" but none has an explicit "You can't do it because of x".
The only thing mentioned is the routing of documents via document id. But since you providing custom routing values and custom routing functions i see no problem with it.
So we are now in an unfortunate situation that our devs saying "Yeah, we have a working prototype for ES offline indexing" and the ES experts saying "Nope, its not possible".
So is there something we miss ?
Any insights appreciated!
Johannes
Here again a more exact version of what we are -doing.
lets say we have 50 concurrent tasks
each task is spawning its own, unconnected, ES in-vm node
each task creates and index in that nodes, setting the shard count to 50 and through custom routing ensuring that only the one of the 50 shards is filled, which belongs to the task
each task creates an snapshot once the indexing is done
after all tasks has been finished:
a single HDFS folder is created with the snapshot/index metadata (taken from one task)
the one shard which is not empty from each task is copied into that folder
we invoke snapshot restore on our live ES cluster and we have our single index with 50 independently created shards online
when querying the data we always have to specify our custom routing function
So this all seem to work pretty well. I'm just wondering if the document id's - which are created outside of the ES cluster concurrently - might potentially collide and cause any Problems... ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.