We have typical deployment of 5000 users per server. We are planning to use Elasticsearch for indexing new data onwards. Our earlier indexing engine was having provision to map one index with one user. It was easier to restore single user index in case of any individual failures.
With Elasticsearch I see there is limitation of using 1000 shards per server. One Elasticsearch index could contain one more shards. Segment level allocation control is not available.
In either of the case, I need to map multiple users to single Elasticsearch Index. In case of any single user item failures, I may need to restore/repair entire Elasticsearch Index.
I wanted avoid unnessaray overriding of data for non impacted users In case of restore and repair.
Can anyone tell me best way to tackle this problem?
Then I would propose putting all users in a single index with a suitable number of primary shards. You can use routing to minimise the number of shards queried and add a user filter at the application layer to ensure each user sees the correct data. This will scale much better and be more efficient than an index per user. This does assume there are no mapping conflicts between users though.
We are planning have a similar approach. Around 1000 users will map to a single Elasticsearch index contain 2 shards of 50 GB each.
To minimise the impact if any Index goes down, we are limiting the mapping to 1000 users. Let me know if you see any issues or any better approach here?
Secondly, the restore using snapshot works at index level. If there will be any issues with single user indexed items, then restoring an index would override other users item unnessasarily. Any idea how to overcome this problem?
I do not see a need to limit it to 1000 users per index. It just adds a step of identifying the correct index for the user without much benefit. If you use routing you can also speed up searching by having a reasonably large number, e.g. 100, primary shards if you use routing.
Restoring an index will indeed affect all users. You can however restore an index under a different name and delete and reindex data for a specific user as they have relatively little data.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.