I am trying to build an elasticsearch cluster(3 master nodes, 2 data nodes to start with, but can be expanded up to 20 data nodes) capable of hosting a large number(4000-4500) of tenants who will index documents that is of different types. .A large number of the tenants(say 4000) is expected have only 1 or 2 types of documents to index
where as a small number of them(say 500) will have up to 5 types of documents to index.Average size of each document is around 1 KB and could contain upto 30-50 fields.
What is the best approach ?
Approach 1: One index(say 1 primary shard, 1 replica) per each document type per tenant.
Approach 2: One index (say 1 primary shard, 1 replica) per tenant. Use elasticsearch type to represent each document type(ie, 1 index will have multiple types).Mandate that same named fields in different document types within the same index needs to be of the same type.
Approach 3: Dynamically create indexes(say 5 primary shards 1 replica, each index hosting up to 50 types) that could host types from multiple tenants.Use an internal field naming strategy(in my application layer) to ensure that fields(in document types) from different tenants are uniquely named.Use appropriate routing to ensure that documents from the same tenant goes to the same shard.
I do not use any parent/Child relationships yet.
The issues that i see with each of the approaches are below.
Approach 1: This will create a cluster that needs to handle up to 21000(4000*2(shards)2(types)+5005(types)*2(shards)) shards.However, given that each shard consumes resources, is this the right approach?
Approach 2: Slightly better than approach 1 given that all document types from a tenant goes into same index. Our cluster will still need to handle upto 9000(40002(shards)+5002(shards)) shards.However, https://github.com/elastic/elasticsearch/issues/15613 indicates that types might go away in future. Given that, is this the right approach?
Approach 3: Given that 50 types goes into a single index, shard requirement for the cluster is down to 2100 shards.However, if elasticseach doesn't allow mutiple types(in the same index) in future, i might be pushing myself into a corner?
Which approach do you think is the best? Or are there other approaches that would allow me to satisfy my multi-tenant use-case?
Many thanks for your response.I am still learning elasticsearch and your words of wisdom definitely helps me to understand the possibilities elasticsearch offers.
4 aspects concern me in the "one index" approach.
For the tenant volumes i expect, this means that 1 index will have 400,000 fields (4000 tenants * 2 document types * 50 fields ). However, for any given document for a tenant, i expect only 50 of these fields to be populated. Discussions captured in https://github.com/elastic/elasticsearch/issues/15613 allude that Lucene might have performance issues when it deals with such sparse structures. Is that something i should be worried about ?
I was hoping(maybe this is not possible) for a low maintenence(just scale up nodes as data volumes grow),self-managing cluster. I am also trying to design a cloud system that is "always on". Given that the "1 index approach" would require us to at least constantly monitor document volumes and performance for an individual tenant, creating a new index and sync-up data from current index(that will be undergoing changes while sync-up is happening) when needed,re-pointing the alias to the new one, were activities i was hoping to do only if they were absolutely essential. I must admit that, i do not have any production experience with elasticsearch yet and it may be that the activities you describe are absolutely essential for a production system ?
"One index " approach is also at the mercy of elasticsearch continuing to support multiple types in an index? (assuming, i use types to differentiate between different document types). I guess the other option could be to use an index with 1 uber type, use some internal field naming algorithm to detect and re-name same named(but different types) fields originating from multiple tenants, use tenant_id as a filter(similar to approach described in https://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html)?
From https://www.elastic.co/guide/en/elasticsearch/guide/current/reindex.html, my understanding is that, once a field is created, it is not possible to remove that field from lucene index. If that is correct, moving a tenant to his own index does not help in reducing the sparse structure of the "1 index" and i will need to reindex the "1 index" at some point? How difficult will be to reindex a given index that is being used(we can assume that not all of them will be active at any given time) potentially by 4000-5000 tenants?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.