We're considering deploying an Elasticsearch cluster to support many thousands of users, each with unique and potentially conflicting data mappings.
Additionally, the data corpus size varies significantly between users.
Is there a more effective solution than creating a separate index for each user? What are the downsides of this approach?
HI,
An alternative approach could be to use a smaller number of indices, and use a field in each document to identify the user that the document belongs to. You can then use this field to filter queries and control access to the data.
You could use the Elasticsearch security features to define roles that have access to only the documents that belong to a specific user. This would allow you to isolate the data for individual users, without needing to create a separate index for each user.
Regards
There has been a lot of improvements in the handling of large numbers of indices and shards in recent versions, so I am not sure what the limits are nowadays. Traditionally Elasticsearch has not scaled or performed well with very large number of small indices and shards, so while tens of thousands of indices/shards may be fine I doubt a single cluster with millions of indices/shards is viable.
Unless someone with more experience of recent clusters with large number of indices chines in I would recommend you test it before going down that route as my hunch about viability may be wrong.
Can you provide more context on what you want to do with this?
From what you described Elasticsearch does not seems to be a good choice.
One main issue is that you could end-up with a lot of indices with different sizes, and this can impact your cluster.
When balancing the cluster Elasticsearch try to keep an equal number of shards in each node and have shards of difference sizes can impact on the balacing of the cluster.
Also, you could have issues with hot spotting, when the resources are unenvely distributed.
And as already mentioned, I don't think a single cluster will work well with million of indices.
Thanks for weighing in.
What approaches can be taken to handle variability in document mappings within a single index when the schema is not enforced? For instance, how can documents with a field defined as an integer by one user and as a string by another user coexist in the same index?
Cheers
Thanks for chiming in.
We're dealing with a problem similar to building a global Elasticsearch service for all municipalities in the world to store and search their residents' data. The total number of documents, assuming a single document per human, is not that large, up to 100 billion.
Some fields, like name and address, are standard, but each municipality may store unique information, like agricultural data (e.g. livestock ownership). And sometimes, different users define field types differently.
Given the extremely large number of shards, wouldn't Elasticsearch be able to distribute them effectively to ensure a balanced load?
What alternative approaches would make sense here?
Is each user only searching its own data?
It is not just distribution of shards that is the issue. Elasticsearch keeps track of all shards and also metadata like mappings and settings related to them so the amount of data that need to be tracked increases with the number of indices and shards. This has been improved lately but this data is stored in the cluster state, which is updated by the master whenever anything changes related to mappings or shard location. Having a very large cluster state, which needs to be replicated across the cluster, can become a bottleneck.
One approach I have seen used in the past to handle scenarios like this assumes each user only can search its own data. If there are any common standard fields these are defined in the index template. A number of generic fields of different types are then defined and mapped as needed. Dynamic fields not part of the standard set are then mapped to these generic fields for each user by the application when indexing and querying. As different types of data will be stored in a single field for different users it does affect relevancy but if data is primarily retrieved using filtering this can work well.
This naturally requires that the field mapping for each user to be maintained outside of Elasticsearch.