I am looking for suggestions with regards to ElasticSearch index partitioning for a multitenant index.
Following is my scenario:
- Customer content containers holding content items (pdfs,docs with metadata) in the order of 100s of thousand.
- Changes to the content metadata quite frequent within a container with newer content text for versioning purposes.
- Frequent use case of users needing to access content across various containers. However the primary use case is search within a container than across containers.
- Updates are triggered only at a container level. Some of the containers may get higher volume of content flow due to bulk updates or content copying between containers. So it is necessary to keep index availability for other containers despite heavy index updates to a few of the containers. The number of containers could be in the order of a few 100,000.
- Primary queries are full text with metadata filtering for content within a container (and across containers as well).
- A user could have access to a couple hundred containers at the most. Meaning he should be able to search across them.
- Search also gets used to render content metadata without searching for content text (filter queries to render content grouping ,organization,metadata for UI display). For eg. rendering folder content.
Considering the above factors Following are my questions with regards to index modeling?
- Is it good to create an index per container. Reasoning behind it would be.
a) Keeping all container operations constrained to its corresponding index. So any kind of segment merges are localized to the index as opposed to having the whole cluster.
b) Or should we group the containers on a different aggregate level and maintain index partitioning in that manner.?
c) Also since containers keep coming and going out relatively in large numbers in a month, the reasoning is that having an index per container would help in purging out older containers easily without triggering updates across the cluster.
d) Another reasoning for having smaller size multiple indexes (index per container) is the fact that we are considering using BitSet based filter (filtering by bitset of docids a user has access to). Having a smaller set of documents in the index would mean a smaller bitset to deal with.
e) Having partitioned out indexes also means we would limit the segment merges to the index that is receiving updates and not across the cluster. Would this understanding be correct?
f) Also any index corruption will get limited only to that specific container (or should I not be worrying about this)
Also in the above scenario is having multiple small elastic search clusters (with 10000s of indexes) a better option than having a large cluster with 100,000s of indexes? (Considering using Mesosphere for managing the Elastic Clusters)?