Offline indexing with multiple shards

Johannes_Zillmann · February 9, 2016, 12:24pm

I have a question.
Is the following possible ?

create multiple indices with multiple ES single node instances
for each index create a snapshot
put all snapshots in one folder and restore them in a ES cluster as a single index

Or are there some architectural things that would bite such a procedure, like the document routing !?

best
Johannes

warkolm · February 10, 2016, 1:55am

You can do this, the metadata is stored with the snapshot so you can restore across different clusters.

Also, did you mean to post this in the hadoop section?

Johannes_Zillmann · February 10, 2016, 10:41am

Ok, good to hear.
I think its requiring a little bit of metadata hacking because first we have x indices, each one having exactly one shard and then we need to convert that to 1 index having x shards and the shard count is stored in the metadata.. but it seems doable.

I guess i posted it in Hadoop because offline indexing can make a lot of sense in such an environment.

Johannes

warkolm · February 10, 2016, 9:39pm

No, that is not possible. It's entirely different to having snapshots from N clusters and then restoring it to a single one.

Johannes_Zillmann · February 10, 2016, 10:54pm

In what problems we would run ?

warkolm · February 11, 2016, 2:46am

The problem of it being unachievable.

Take a read of https://www.elastic.co/guide/en/elasticsearch/guide/master/routing-value.html

Johannes_Zillmann · February 11, 2016, 8:57am

Ok, makes sense. But if we would apply custom routing it could work right ?
We:

also have a fixed number of shards
we can number then from 1..n when creating it
we write our document id with the shard number in it
our custom router routes to the correct shard based on the document id

WDYT ?

warkolm · February 11, 2016, 8:53pm

Nope.

Christian_Dahlqvist · February 11, 2016, 10:01pm

I assume you want to use standalone ES instances in order to reduce the network traffic and ensure that indexing is done locally, is this correct?

If so, what you can do, at least as long as you do not need to update your records at a later stage, is to set up all your indexing nodes in a cluster and configure the number of shards so that they have one primary shard each and no replicas. Then index a number of records so that at least one record has been written to each shard. Retrieve the id of one record from each shard and use this as a routing key for all records you want to index into that particular shard.

This will give you a single index with a number of shards, while it still allows all documents to be indexed locally only (as long as you provide the correct routing keys). It is not elegant, but I think it should work.

Johannes_Zillmann · February 11, 2016, 11:15pm

Ok, i see. That would eliminate the need for a custom routing function i guess!?

There are multiple reasons why we evaluate the offline indexing capabilities of ES.
The main thing is speed i guess. Also telling a few other aspects:

we "create" pieces of data in parallel, lets say a map-reduce job
once the job is done and the files are available (in HDFS) we make them available for other parts of our system
similar we want to have the data (duplicated) in ES as well
so producing an index would be a kind of side-effect for the jobs (shouldn't slow down them too much)
since the ES cluster on top of YARN has no reliable persistence, we want to store the indices in hdfs for recovery
we do not need to search the index until its creation has finished nor do we need to update an existing index

Johannes_Zillmann · February 11, 2016, 11:18pm

@warkolm Can you elaborate ?

warkolm · February 12, 2016, 12:30am

You cannot just "create" an index with shards outside of Elasticsearch, there are a number of things that it does during this process that doesn't lend itself to this process.
You're also not going to do any less work than actually doing it directly in Elasticsearch.

There's been other threads where people have asked this in this area as well, it'd be worth checking them.

Christian_Dahlqvist · February 12, 2016, 6:18am

Any specific reason you are considering this type of complicated workaround instead of using the existing ES-Hadoop connector?

Johannes_Zillmann · February 12, 2016, 11:21am

Thanks @warkolm!

@Christian_Dahlqvist our job-pipeline already stands as it is. An ES index would just be an additional product of it!

Johannes_Zillmann · February 15, 2016, 10:44am

Sorry to come back to this...
I took the time and evaluated other threads:

All have the general advise "Don't do it" but none has an explicit "You can't do it because of x".
The only thing mentioned is the routing of documents via document id. But since you providing custom routing values and custom routing functions i see no problem with it.

So we are now in an unfortunate situation that our devs saying "Yeah, we have a working prototype for ES offline indexing" and the ES experts saying "Nope, its not possible".

So is there something we miss ?
Any insights appreciated!
Johannes

Johannes_Zillmann · February 18, 2016, 11:11am

Here again a more exact version of what we are -doing.

lets say we have 50 concurrent tasks
each task is spawning its own, unconnected, ES in-vm node
each task creates and index in that nodes, setting the shard count to 50 and through custom routing ensuring that only the one of the 50 shards is filled, which belongs to the task
each task creates an snapshot once the indexing is done
after all tasks has been finished:
a single HDFS folder is created with the snapshot/index metadata (taken from one task)
the one shard which is not empty from each task is copied into that folder
we invoke snapshot restore on our live ES cluster and we have our single index with 50 independently created shards online
when querying the data we always have to specify our custom routing function

So this all seem to work pretty well. I'm just wondering if the document id's - which are created outside of the ES cluster concurrently - might potentially collide and cause any Problems... ?

@warkolm @costin any thoughts ?

Topic		Replies	Views
Moving big indices around with two or more instances running in the same box, working separately Elasticsearch	10	1253	July 5, 2017
Shard Elasticsearch	6	297	July 6, 2017
Is it possible to have elastic search index documents in 2 locations? Elasticsearch	2	762	December 1, 2017
Shard Elasticsearch	1	243	July 6, 2017
Multiple data directories ->parallel search of shards on same instance? Elasticsearch	6	3400	July 5, 2017

Offline indexing with multiple shards

Related topics