I have done a lot of reading on the "Definitive Elastic Guide", read many of the suggestions on the web and in this community... but yet, I still have a handful of questions and concerns about the Index and sharding approach because our scenario is significantly different.
The Scenario
Mutli-tenant env where multiple customers store their docs a single Index
Customers can very in size and usage (number of docs and read frequency) significantly
Need ability to reindex customer to their own dedicated indexes or shard when they get "too big" for shared index
Need to provide row-level security to provide tenant-level read access.
Documents have common/identical fields but each tenant can define additional custom fields
We will be using elasticsearch 7.x or higher. so no multi _doc type support
This is a new product, and thus we have no idea what the work loads or data models will be. So, it's important to have a flexible architecture and code that can support easy re-architecture.
Pricing for this product does not justify expensive x-pack or other solutions. we have to stay with free or very low cost options
This will be hosted on AWS or Azure. No Elastic Cloud
The Design
This is our initial design idea. feel free to suggest, poke wholes, ask questions, points out any issues...
_id = tenantID + documentID to assure uniqueness
All the common/identical doc fields will be added to the index
All customer-defined doc fields will be added to the index with a unique name to make sure they do not collide with each other.
When customer is making a request then use RESTReadOnly plugin to provide dc-level access and use source filtering to only return common fields + custom fields
If field count gets above 1000+ then move some customers with large number of custom fields to new index to reduce number of fields.
If customer read and query load gets too big / or data gets too big then move to new index or new shard.
Potential Problems
As stated above. this is our initial design for a new product and we are new to elastic. we don't know what we don't know and we don't have real life data to make any decisions. but here are some areas that we think we might run into issues and we would like your thoughts on them:
Is this a viable multi-tenant/shared index design?
Is RESTReadOnly a good choice? any other suggestions for controlling doc-level access?
Can 1000 fields be supported in a single index without a outsized costs on RAM, CPU and Network
Is source filtering a good option to limiting available fields in responses?
Is our re-indexing and re-sharding a sound approach for tenants that get too big?
Are there are tools out there to help us move/split indexes?
What else are we missing?
Your help, experience, feedback, encouragement would be greatly appreciated.
Ok, well you can take this with a grain of salt, but you'd probably be best off investing in getting someone in - as in paying for their time - to spend time understanding this in detail and giving you advice and direction. We have services that can assist, but there are other people out there too.
I say that because while we can provide answers here to the best of our ability, it is not what I would bet the future of a business critical project on. And yes, that applies to the advice that I give as well
Onto your questions;
Depends on what sort of data this is, is it time based or something else?
We'd suggesting using our Security functionality with field and document security. It's an ok choice if it works for you though
What is "outsized"?
It's an option, good is questionable unless you are doing query validation to stop someone trying to get around things by directly querying your cluster
Yes it's a sane approach. Just put some kind of monitoring around things so you can spot large tenants before they get large
Use the _reindex and/or _split APIs. There may be wrappers around those, but I haven't seen any yet. Alternatively look at seeing if ILM will work for you
See my earlier points. I would strongly suggest that it'd be worth you using Elastic Cloud to save yourself having to manage the underlying instances and giving you access to field and document level security, machine learning and alerting to automatically track tenant growth and then let you know if there's anomalies, automated backups, latest versions and heaps more
Thanks for the answers. To respond to your points:
Data is not time series/based. it is de-normalized documents.
could you please provide a few links to elastic docs on using field and document security.
1000 fields for doc type. would that be considered too big/outsized?
if a tenant uses query to get to the fields that are masked by source filter then that's not an issue. they'll just get bunch of unused fields. the only concern is document-level query. which i am assuming will not be an issue because of #2
we really tried to use Elastic Cloud, and are still going to try. It's just our customers/legal requirements don't allow us to.
Having users be able to define their own fields can lead to mapping conflicts as well as large mappings. One way I have seen this handled for systems with very large number of users is to insert an API layer ahead of Elasticsearch and not give users direct access to Elasticsearch. This API layer can ensure that the data is filtered per tenant and that all indexed data contains a proper tenent id field. This might allow you to run with a single system user rather than set up millions of unique users in the system (which may be too much to manage). It may also allow you to create a few generic fields of each supported data type and then map custom tenant fields onto these standard fields. This reduces mapping size and avoids conflicts but can narurally affect scoring and will require you to rename fields when indexing and searching.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.