I'm attempting to integrate elasticsearch into a multitenant web
application. I have data segmented into tens of thousands of
'tenants', and then further subdivided by user within a tenant. I'd
like to make it so that my users can readily access any data within
their tenant, with optional visibility rules allowing finer grained
sharing (for instance, sharing certain types of data with other users
in the tenant, while retaining exclusive access to other types).
Towards this goal, I'm trying to figure out the best way of indexing
my documents within ES. My initial impulse was to create an index for
each tenant, but some cursory research indicated this was a Bad Idea.
Maintaining tens of thousands of indexes while adding more every time
a new tenant is created is almost certainly untenable. I'm stuck,
therefore, trying to decide what criteria to use when creating
indexes. I have a few ideas, mostly centering around heuristic data
such as geographic location, number of active users and so forth, but
nothing jumps out as the obviously best course of action. Though,
regardless of how many indexes I'm running and how I'm determining
which data to index in each, it seems like routing documents based on
the tenant id would be ideal for my needs. Can anyone offer some
advice on what kind of indexing strategy to employ for this type of
use case?
Some additional information that might be relevant:
Each tenant/user has the same types of data to index, but there may
be differences in how each type is mapped. That is, a type might have
some fields for one user, and others for another, and may need to be
tokenized/analyzed differently for both. This seems to indicate that
establishing different indexes based on different type mappings may be
the way to go, but I doubt there are enough such differences to
warrant more than a handful of different indexes. Is there any
performance hit associated with putting vast amounts of data into a
small number of indexes, assuming a per-tenant id routing strategy?
Almost all queries will need to be filtered by tenant, by user, or
by some combination of visibility rules. That said, some users need
the ability to query across all tenants, but the performance of such
queries need not be as high.
I'm using MongoDB as my data store, and see a fairly obvious one-to-
one mapping of Mongo Collection to ES document type.This suggests that
using types as a way of dividing data within an index by tenant might
not work, since I will likely need to use the types for collection
mapping.
Hi Ellery,
I'm new to ES so I'm afraid I don't have answers for you, but I am
curious what led you to the conclusion that creating an index per
tenant was a bad idea?
I'm attempting to integrate elasticsearch into a multitenant web
application. I have data segmented into tens of thousands of
'tenants', and then further subdivided by user within a tenant. I'd
like to make it so that my users can readily access any data within
their tenant, with optional visibility rules allowing finer grained
sharing (for instance, sharing certain types of data with other users
in the tenant, while retaining exclusive access to other types).
Towards this goal, I'm trying to figure out the best way of indexing
my documents within ES. My initial impulse was to create an index for
each tenant, but some cursory research indicated this was a Bad Idea.
Maintaining tens of thousands of indexes while adding more every time
a new tenant is created is almost certainly untenable. I'm stuck,
therefore, trying to decide what criteria to use when creating
indexes. I have a few ideas, mostly centering around heuristic data
such as geographic location, number of active users and so forth, but
nothing jumps out as the obviously best course of action. Though,
regardless of how many indexes I'm running and how I'm determining
which data to index in each, it seems like routing documents based on
the tenant id would be ideal for my needs. Can anyone offer some
advice on what kind of indexing strategy to employ for this type of
use case?
Some additional information that might be relevant:
Each tenant/user has the same types of data to index, but there may
be differences in how each type is mapped. That is, a type might have
some fields for one user, and others for another, and may need to be
tokenized/analyzed differently for both. This seems to indicate that
establishing different indexes based on different type mappings may be
the way to go, but I doubt there are enough such differences to
warrant more than a handful of different indexes. Is there any
performance hit associated with putting vast amounts of data into a
small number of indexes, assuming a per-tenant id routing strategy?
Almost all queries will need to be filtered by tenant, by user, or
by some combination of visibility rules. That said, some users need
the ability to query across all tenants, but the performance of such
queries need not be as high.
I'm using MongoDB as my data store, and see a fairly obvious one-to-
one mapping of Mongo Collection to ES document type.This suggests that
using types as a way of dividing data within an index by tenant might
not work, since I will likely need to use the types for collection
mapping.
My initial impulse was to create an index for
each tenant, but some cursory research indicated this was a Bad Idea.
Yes, I'd agree with that. Each index comes with overhead, so 10,20,100
maybe more indices would be fine. 10,000 wouldn't.
Maintaining tens of thousands of indexes while adding more every time
a new tenant is created is almost certainly untenable. I'm stuck,
therefore, trying to decide what criteria to use when creating
indexes. I have a few ideas, mostly centering around heuristic data
such as geographic location, number of active users and so forth, but
nothing jumps out as the obviously best course of action. Though,
regardless of how many indexes I'm running and how I'm determining
which data to index in each, it seems like routing documents based on
the tenant id would be ideal for my needs. Can anyone offer some
advice on what kind of indexing strategy to employ for this type of
use case?
I'd say that you should just give each doc that belongs to a particular
tenant a tenant ID, then you can filter the results based on that. And
I agree with your idea of using the tenant ID for routing.
Each tenant/user has the same types of data to index, but there may
be differences in how each type is mapped. That is, a type might have
some fields for one user, and others for another, and may need to be
tokenized/analyzed differently for both. This seems to indicate that
establishing different indexes based on different type mappings may be
the way to go, but I doubt there are enough such differences to
warrant more than a handful of different indexes.
A few options here. It may be possible to use a single type for all of
your tenants. For instance:
if one tenant has fields foo and bar, and another has bar and baz,
you can store docs from both tenants in the same type, just adding
the relevant fields
failing that, you could just name the fields differently:
name_v1, name_v2
if the mappings are so different that you don't want to combine them
into one type, then you could use different types within the
same index, eg user_v1, user_v2
Is there any
performance hit associated with putting vast amounts of data into a
small number of indexes, assuming a per-tenant id routing strategy?
No, this is something that Elasticsearch is good at. First you can
play around with the number of shards that you set when you create the
index. Second, the number of replicas, which can be updated dynamically.
Third, you can look at using aliases to combine multiple indices (for
read purposes, to write, you will need to write to one index, or an
alias that points to only one index).
Almost all queries will need to be filtered by tenant, by user, or
by some combination of visibility rules. That said, some users need
the ability to query across all tenants, but the performance of such
queries need not be as high.
Querying across indices is easy, and querying the same index filtering
by one or many tenant and user IDs is also easy and fast.
Hi Ellery,
I'm new to ES so I'm afraid I don't have answers for you, but I am
curious what led you to the conclusion that creating an index per
tenant was a bad idea?
I'm attempting to integrate elasticsearch into a multitenant web
application. I have data segmented into tens of thousands of
'tenants', and then further subdivided by user within a tenant. I'd
like to make it so that my users can readily access any data within
their tenant, with optional visibility rules allowing finer grained
sharing (for instance, sharing certain types of data with other users
in the tenant, while retaining exclusive access to other types).
Towards this goal, I'm trying to figure out the best way of indexing
my documents within ES. My initial impulse was to create an index for
each tenant, but some cursory research indicated this was a Bad Idea.
Maintaining tens of thousands of indexes while adding more every time
a new tenant is created is almost certainly untenable. I'm stuck,
therefore, trying to decide what criteria to use when creating
indexes. I have a few ideas, mostly centering around heuristic data
such as geographic location, number of active users and so forth, but
nothing jumps out as the obviously best course of action. Though,
regardless of how many indexes I'm running and how I'm determining
which data to index in each, it seems like routing documents based on
the tenant id would be ideal for my needs. Can anyone offer some
advice on what kind of indexing strategy to employ for this type of
use case?
Some additional information that might be relevant:
Each tenant/user has the same types of data to index, but there may
be differences in how each type is mapped. That is, a type might have
some fields for one user, and others for another, and may need to be
tokenized/analyzed differently for both. This seems to indicate that
establishing different indexes based on different type mappings may be
the way to go, but I doubt there are enough such differences to
warrant more than a handful of different indexes. Is there any
performance hit associated with putting vast amounts of data into a
small number of indexes, assuming a per-tenant id routing strategy?
Almost all queries will need to be filtered by tenant, by user, or
by some combination of visibility rules. That said, some users need
the ability to query across all tenants, but the performance of such
queries need not be as high.
I'm using MongoDB as my data store, and see a fairly obvious one-to-
one mapping of Mongo Collection to ES document type.This suggests that
using types as a way of dividing data within an index by tenant might
not work, since I will likely need to use the types for collection
mapping.
Any advice on this issue is much appreciated.
--
Alexandre Heimburger
R&D Manager
blueKiwi Software
tel : +33687880997
email : ahb@bluekiwi-software.com
adress : 93 rue Vieille du Temple, 75003 Paris
What is blueKiwi? blueKiwi - the first Enterprise Social Software Suite in
the world building professional networks on conversations and relationships
helps large organizations increase their productivity, foster innovations
and boost people satisfaction.
I also work for a multi-tenant product and we have built one index per
tenant.
Using one index per tenant is often the right solution, but not if you
have 10,000 tenants.
Each shard in each index is a separate Lucene instance, which has some
overhead. On top of that, each node in the the cluster needs to
maintain information about all indices and shards that exist in the
cluster, which has its own overhead.
A few options here. It may be possible to use a single type for all of
your tenants. For instance:
if one tenant has fields foo and bar, and another has bar and baz,
you can store docs from both tenants in the same type, just adding
the relevant fields
failing that, you could just name the fields differently:
name_v1, name_v2
if the mappings are so different that you don't want to combine them
into one type, then you could use different types within the
same index, eg user_v1, user_v2
Thanks for the ideas- I shall read up on each of them!
I also realize I might not have given a suitable example when I was
discussing the document type requirements- my apologies. A clearer
picture of my needs is something like this: within my application, I
have different types of data, such as "User", "Location", "BlogPost",
and so on. Every tenant would need to have documents of those types,
but the structure of the data might be different from tenant to
tenant. For instance, my "Location" documents might have Street, City,
Zipcode, County and State for tenants in the United States, but
Prefecture, Municipality, City, District, City Block, House Number,
and Postal Code for tenants in Japan. This is a simple example- the
structure could vary quite a bit more than that, including deeply
nested fields, parent/child relationships, and so on. In addition to
the different fields, I may also want to tokenize/analyze/etc the
documents differently from tenant to tenant, depending on the needs of
the users in the tenant. Given that, for instance, a US tenant would
never want to see Location documents with a "Prefecture" field, or a
Spanish tenant might want to analyze fields in 'BlogPost' documents
differently than an English tenant, it seemed that multiple indexes
with different type mappings in each was the way to go. However, I
shall investigate the options that you presented- as I mentioned, I am
still very much an elasticsearch newbie Thank you again for your
help!
Given that, for instance, a US tenant would
never want to see Location documents with a "Prefecture" field,
If you don't set one, they won't see it. The mapping knows how to
handle the different field types, but it won't "auto-create" them in
your doc if they are missing.
Spanish tenant might want to analyze fields in 'BlogPost' documents
differently than an English tenant,
Sure - using the _analyze field I mentioned, you might be able to
achieve what you want here.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.