Offline indexing with multiple shards

I have a question.
Is the following possible ?

  • create multiple indices with multiple ES single node instances
  • for each index create a snapshot
  • put all snapshots in one folder and restore them in a ES cluster as a single index

Or are there some architectural things that would bite such a procedure, like the document routing !?

best
Johannes

You can do this, the metadata is stored with the snapshot so you can restore across different clusters.

Also, did you mean to post this in the hadoop section?

Ok, good to hear.
I think its requiring a little bit of metadata hacking because first we have x indices, each one having exactly one shard and then we need to convert that to 1 index having x shards and the shard count is stored in the metadata.. but it seems doable.

I guess i posted it in Hadoop because offline indexing can make a lot of sense in such an environment.

Johannes

No, that is not possible. It's entirely different to having snapshots from N clusters and then restoring it to a single one.

In what problems we would run ?

The problem of it being unachievable.

Take a read of https://www.elastic.co/guide/en/elasticsearch/guide/master/routing-value.html

Ok, makes sense. But if we would apply custom routing it could work right ?
We:

  • also have a fixed number of shards
  • we can number then from 1..n when creating it
  • we write our document id with the shard number in it
  • our custom router routes to the correct shard based on the document id

WDYT ?

Nope.

I assume you want to use standalone ES instances in order to reduce the network traffic and ensure that indexing is done locally, is this correct?

If so, what you can do, at least as long as you do not need to update your records at a later stage, is to set up all your indexing nodes in a cluster and configure the number of shards so that they have one primary shard each and no replicas. Then index a number of records so that at least one record has been written to each shard. Retrieve the id of one record from each shard and use this as a routing key for all records you want to index into that particular shard.

This will give you a single index with a number of shards, while it still allows all documents to be indexed locally only (as long as you provide the correct routing keys). It is not elegant, but I think it should work.

Ok, i see. That would eliminate the need for a custom routing function i guess!?

There are multiple reasons why we evaluate the offline indexing capabilities of ES.
The main thing is speed i guess. Also telling a few other aspects:

  • we "create" pieces of data in parallel, lets say a map-reduce job
  • once the job is done and the files are available (in HDFS) we make them available for other parts of our system
  • similar we want to have the data (duplicated) in ES as well
  • so producing an index would be a kind of side-effect for the jobs (shouldn't slow down them too much)
  • since the ES cluster on top of YARN has no reliable persistence, we want to store the indices in hdfs for recovery
  • we do not need to search the index until its creation has finished nor do we need to update an existing index

@warkolm Can you elaborate ?

You cannot just "create" an index with shards outside of Elasticsearch, there are a number of things that it does during this process that doesn't lend itself to this process.
You're also not going to do any less work than actually doing it directly in Elasticsearch.

There's been other threads where people have asked this in this area as well, it'd be worth checking them.

1 Like

Any specific reason you are considering this type of complicated workaround instead of using the existing ES-Hadoop connector?

Thanks @warkolm!

@Christian_Dahlqvist our job-pipeline already stands as it is. An ES index would just be an additional product of it!

Sorry to come back to this...
I took the time and evaluated other threads:

All have the general advise "Don't do it" but none has an explicit "You can't do it because of x".
The only thing mentioned is the routing of documents via document id. But since you providing custom routing values and custom routing functions i see no problem with it.

So we are now in an unfortunate situation that our devs saying "Yeah, we have a working prototype for ES offline indexing" and the ES experts saying "Nope, its not possible".

So is there something we miss ?
Any insights appreciated!
Johannes

Here again a more exact version of what we are -doing.

  • lets say we have 50 concurrent tasks
  • each task is spawning its own, unconnected, ES in-vm node
  • each task creates and index in that nodes, setting the shard count to 50 and through custom routing ensuring that only the one of the 50 shards is filled, which belongs to the task
  • each task creates an snapshot once the indexing is done
  • after all tasks has been finished:
  • a single HDFS folder is created with the snapshot/index metadata (taken from one task)
  • the one shard which is not empty from each task is copied into that folder
  • we invoke snapshot restore on our live ES cluster and we have our single index with 50 independently created shards online
  • when querying the data we always have to specify our custom routing function

So this all seem to work pretty well. I'm just wondering if the document id's - which are created outside of the ES cluster concurrently - might potentially collide and cause any Problems... ?

@warkolm @costin any thoughts ?