Hi, I'm currently in the process of designing a multi-tenant indexing strategy for my system. I'm generally aware of the approaches commonly used in such cases -
Index per tenant
Shared index across tenants
Some variation of the above with custom routing based on a tenant id
The nature of my documents are such that they transition between many 'states', and in an initial state will be unassigned to a tenant(without a tenant id), till they eventually get assigned to one of the tenants in the system, at which point the assigned tenant id is immutable. I think this generally rules out using a tenant id as a routing key, as documents created initially are not associated with a tenant. The only straightforward option I can think of functionally is option 2, lumping together all documents in an index and filtering by a tenant id, such that a tenant can retrieve only docs assigned to it. The downside is that due to the nature of my data i am unable to optimize per tenant in any way, having to search across all my documents and then filter. Is there any other way that an Elastic guru can recommend? Thanks
When you refer to 'tenant'.
Is it a physical host you want to route the data into?
Or it is all logically designed on the same hardware?
I would go with the index per tenant approach.
e.g.
Tenant1 will have indices like tenant1-my-data-source-date kind of naming convention.
Tenant2 will have indices like tenant2-my-data-source-date.
etc. etc.
Thanks for your reply. I am looking to logically separate the data. However, as I previously mentioned, when the document is first indexed, it does not belong to any tenant, rather sits unassigned in our system. Subsequently, the document is 'assigned' to one of our tenants. So what I am grappling with is the best way to move this document into an appropriate tenant's index(or assign it a routing key) when the document gets updated with tenant information, which i do not have initially.
How many tenants do you need to support? What is the expected total data volume? How many concurrent queries do you need to support? How much data are you indexing per day?
Not sure how you are doing the assigning part.
But at this stage, you might re-index the document to the relevant new index.
So initially you will have to document into a generic index --> tenant0-my-data-source-date
Then re-index the document into the relevant index --> tenantN-my-data-source-date.
This is one of many optional solutions.
This 'assignment' happens at some instance in time outside this system and the document on ES will be updated with a tenant-id at that point. Yes, indexing into a generic index and re-indexing is an option I'm considering, but just on the face of it, sounds sub-optimal to me. Being new to ES, I was wondering if there is a better option. Atleast till now, it seems like it is a toss up between your suggestion and concluding that ES may not fit our use case very well, and evaluate other solutions. Appreciate the inputs btw!
Yes, seems like ELK is not optimal, though possible to find a solution.
So when the document is first created without a tenant assigned, what is the purpose of it being in elastic? Do you still search it? Use it for visualisations?
If not, skip this stage and just send the doc when it is moving to a more relevant status.
just thoughts without really understanding your full needs.
Sure, all great questions. Unfortunately, we would still like to have these documents available to search in its initial state. Agree, skipping is an option if we did not want to search this data. The proposed solution is probably workable starting off, but I foresee issues if and when the data/no. of tenants grow over time.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.