What is best indexing strategy for multitenant data?


(Ellery Crane-2) #1

I'm attempting to integrate elasticsearch into a multitenant web
application. I have data segmented into tens of thousands of
'tenants', and then further subdivided by user within a tenant. I'd
like to make it so that my users can readily access any data within
their tenant, with optional visibility rules allowing finer grained
sharing (for instance, sharing certain types of data with other users
in the tenant, while retaining exclusive access to other types).

Towards this goal, I'm trying to figure out the best way of indexing
my documents within ES. My initial impulse was to create an index for
each tenant, but some cursory research indicated this was a Bad Idea.
Maintaining tens of thousands of indexes while adding more every time
a new tenant is created is almost certainly untenable. I'm stuck,
therefore, trying to decide what criteria to use when creating
indexes. I have a few ideas, mostly centering around heuristic data
such as geographic location, number of active users and so forth, but
nothing jumps out as the obviously best course of action. Though,
regardless of how many indexes I'm running and how I'm determining
which data to index in each, it seems like routing documents based on
the tenant id would be ideal for my needs. Can anyone offer some
advice on what kind of indexing strategy to employ for this type of
use case?

Some additional information that might be relevant:

  • Each tenant/user has the same types of data to index, but there may
    be differences in how each type is mapped. That is, a type might have
    some fields for one user, and others for another, and may need to be
    tokenized/analyzed differently for both. This seems to indicate that
    establishing different indexes based on different type mappings may be
    the way to go, but I doubt there are enough such differences to
    warrant more than a handful of different indexes. Is there any
    performance hit associated with putting vast amounts of data into a
    small number of indexes, assuming a per-tenant id routing strategy?

  • Almost all queries will need to be filtered by tenant, by user, or
    by some combination of visibility rules. That said, some users need
    the ability to query across all tenants, but the performance of such
    queries need not be as high.

  • I'm using MongoDB as my data store, and see a fairly obvious one-to-
    one mapping of Mongo Collection to ES document type.This suggests that
    using types as a way of dividing data within an index by tenant might
    not work, since I will likely need to use the types for collection
    mapping.

Any advice on this issue is much appreciated.


(Drew H) #2

Hi Ellery,
I'm new to ES so I'm afraid I don't have answers for you, but I am
curious what led you to the conclusion that creating an index per
tenant was a bad idea?

Thanks,
Drew

On May 5, 2:32 pm, Ellery Crane seid...@gmail.com wrote:

I'm attempting to integrate elasticsearch into a multitenant web
application. I have data segmented into tens of thousands of
'tenants', and then further subdivided by user within a tenant. I'd
like to make it so that my users can readily access any data within
their tenant, with optional visibility rules allowing finer grained
sharing (for instance, sharing certain types of data with other users
in the tenant, while retaining exclusive access to other types).

Towards this goal, I'm trying to figure out the best way of indexing
my documents within ES. My initial impulse was to create an index for
each tenant, but some cursory research indicated this was a Bad Idea.
Maintaining tens of thousands of indexes while adding more every time
a new tenant is created is almost certainly untenable. I'm stuck,
therefore, trying to decide what criteria to use when creating
indexes. I have a few ideas, mostly centering around heuristic data
such as geographic location, number of active users and so forth, but
nothing jumps out as the obviously best course of action. Though,
regardless of how many indexes I'm running and how I'm determining
which data to index in each, it seems like routing documents based on
the tenant id would be ideal for my needs. Can anyone offer some
advice on what kind of indexing strategy to employ for this type of
use case?

Some additional information that might be relevant:

  • Each tenant/user has the same types of data to index, but there may
    be differences in how each type is mapped. That is, a type might have
    some fields for one user, and others for another, and may need to be
    tokenized/analyzed differently for both. This seems to indicate that
    establishing different indexes based on different type mappings may be
    the way to go, but I doubt there are enough such differences to
    warrant more than a handful of different indexes. Is there any
    performance hit associated with putting vast amounts of data into a
    small number of indexes, assuming a per-tenant id routing strategy?

  • Almost all queries will need to be filtered by tenant, by user, or
    by some combination of visibility rules. That said, some users need
    the ability to query across all tenants, but the performance of such
    queries need not be as high.

  • I'm using MongoDB as my data store, and see a fairly obvious one-to-
    one mapping of Mongo Collection to ES document type.This suggests that
    using types as a way of dividing data within an index by tenant might
    not work, since I will likely need to use the types for collection
    mapping.

Any advice on this issue is much appreciated.


(Clinton Gormley) #3

Hi Ellery

My initial impulse was to create an index for
each tenant, but some cursory research indicated this was a Bad Idea.

Yes, I'd agree with that. Each index comes with overhead, so 10,20,100
maybe more indices would be fine. 10,000 wouldn't.

Maintaining tens of thousands of indexes while adding more every time
a new tenant is created is almost certainly untenable. I'm stuck,
therefore, trying to decide what criteria to use when creating
indexes. I have a few ideas, mostly centering around heuristic data
such as geographic location, number of active users and so forth, but
nothing jumps out as the obviously best course of action. Though,
regardless of how many indexes I'm running and how I'm determining
which data to index in each, it seems like routing documents based on
the tenant id would be ideal for my needs. Can anyone offer some
advice on what kind of indexing strategy to employ for this type of
use case?

I'd say that you should just give each doc that belongs to a particular
tenant a tenant ID, then you can filter the results based on that. And
I agree with your idea of using the tenant ID for routing.

  • Each tenant/user has the same types of data to index, but there may
    be differences in how each type is mapped. That is, a type might have
    some fields for one user, and others for another, and may need to be
    tokenized/analyzed differently for both. This seems to indicate that
    establishing different indexes based on different type mappings may be
    the way to go, but I doubt there are enough such differences to
    warrant more than a handful of different indexes.

A few options here. It may be possible to use a single type for all of
your tenants. For instance:

  • if one tenant has fields foo and bar, and another has bar and baz,
    you can store docs from both tenants in the same type, just adding
    the relevant fields

  • you mention different analysis. how would this be different? If
    it is a question of language, then you might be able to make
    this work by using the _analyzer field:
    http://www.elasticsearch.org/guide/reference/mapping/analyzer-field.html

    alternatively, you could use multi-fields, where one version of the
    field is analyzed with analyzer_1, and another version with
    analyzer_2
    http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html

    failing that, you could just name the fields differently:
    name_v1, name_v2

  • if the mappings are so different that you don't want to combine them
    into one type, then you could use different types within the
    same index, eg user_v1, user_v2

Is there any
performance hit associated with putting vast amounts of data into a
small number of indexes, assuming a per-tenant id routing strategy?

No, this is something that ElasticSearch is good at. First you can
play around with the number of shards that you set when you create the
index. Second, the number of replicas, which can be updated dynamically.
Third, you can look at using aliases to combine multiple indices (for
read purposes, to write, you will need to write to one index, or an
alias that points to only one index).

  • Almost all queries will need to be filtered by tenant, by user, or
    by some combination of visibility rules. That said, some users need
    the ability to query across all tenants, but the performance of such
    queries need not be as high.

Querying across indices is easy, and querying the same index filtering
by one or many tenant and user IDs is also easy and fast.

hth

clint


(Alexandre Heimburger) #4

hi

I also work for a multi-tenant product and we have built one index per
tenant.

Then we have indices for users, spaces, content etc..

On Fri, May 6, 2011 at 1:39 AM, Drew H hite.drew@gmail.com wrote:

Hi Ellery,
I'm new to ES so I'm afraid I don't have answers for you, but I am
curious what led you to the conclusion that creating an index per
tenant was a bad idea?

Thanks,
Drew

On May 5, 2:32 pm, Ellery Crane seid...@gmail.com wrote:

I'm attempting to integrate elasticsearch into a multitenant web
application. I have data segmented into tens of thousands of
'tenants', and then further subdivided by user within a tenant. I'd
like to make it so that my users can readily access any data within
their tenant, with optional visibility rules allowing finer grained
sharing (for instance, sharing certain types of data with other users
in the tenant, while retaining exclusive access to other types).

Towards this goal, I'm trying to figure out the best way of indexing
my documents within ES. My initial impulse was to create an index for
each tenant, but some cursory research indicated this was a Bad Idea.
Maintaining tens of thousands of indexes while adding more every time
a new tenant is created is almost certainly untenable. I'm stuck,
therefore, trying to decide what criteria to use when creating
indexes. I have a few ideas, mostly centering around heuristic data
such as geographic location, number of active users and so forth, but
nothing jumps out as the obviously best course of action. Though,
regardless of how many indexes I'm running and how I'm determining
which data to index in each, it seems like routing documents based on
the tenant id would be ideal for my needs. Can anyone offer some
advice on what kind of indexing strategy to employ for this type of
use case?

Some additional information that might be relevant:

  • Each tenant/user has the same types of data to index, but there may
    be differences in how each type is mapped. That is, a type might have
    some fields for one user, and others for another, and may need to be
    tokenized/analyzed differently for both. This seems to indicate that
    establishing different indexes based on different type mappings may be
    the way to go, but I doubt there are enough such differences to
    warrant more than a handful of different indexes. Is there any
    performance hit associated with putting vast amounts of data into a
    small number of indexes, assuming a per-tenant id routing strategy?

  • Almost all queries will need to be filtered by tenant, by user, or
    by some combination of visibility rules. That said, some users need
    the ability to query across all tenants, but the performance of such
    queries need not be as high.

  • I'm using MongoDB as my data store, and see a fairly obvious one-to-
    one mapping of Mongo Collection to ES document type.This suggests that
    using types as a way of dividing data within an index by tenant might
    not work, since I will likely need to use the types for collection
    mapping.

Any advice on this issue is much appreciated.

--
Alexandre Heimburger
R&D Manager
blueKiwi Software
tel : +33687880997
email : ahb@bluekiwi-software.com
adress : 93 rue Vieille du Temple, 75003 Paris

What is blueKiwi? blueKiwi - the first Enterprise Social Software Suite in
the world building professional networks on conversations and relationships

  • helps large organizations increase their productivity, foster innovations
    and boost people satisfaction.

(Clinton Gormley) #5

Hiya

I also work for a multi-tenant product and we have built one index per
tenant.

Using one index per tenant is often the right solution, but not if you
have 10,000 tenants.

Each shard in each index is a separate Lucene instance, which has some
overhead. On top of that, each node in the the cluster needs to
maintain information about all indices and shards that exist in the
cluster, which has its own overhead.

clint


(Ellery Crane-2) #6

On May 6, 6:00 am, Clinton Gormley clin...@iannounce.co.uk wrote:

A few options here. It may be possible to use a single type for all of
your tenants. For instance:

  • if one tenant has fields foo and bar, and another has bar and baz,
    you can store docs from both tenants in the same type, just adding
    the relevant fields

  • you mention different analysis. how would this be different? If
    it is a question of language, then you might be able to make
    this work by using the _analyzer field:
    http://www.elasticsearch.org/guide/reference/mapping/analyzer-field.html

    alternatively, you could use multi-fields, where one version of the
    field is analyzed with analyzer_1, and another version with
    analyzer_2
    http://www.elasticsearch.org/guide/reference/mapping/multi-field-type...

    failing that, you could just name the fields differently:
    name_v1, name_v2

  • if the mappings are so different that you don't want to combine them
    into one type, then you could use different types within the
    same index, eg user_v1, user_v2

Thanks for the ideas- I shall read up on each of them!

I also realize I might not have given a suitable example when I was
discussing the document type requirements- my apologies. A clearer
picture of my needs is something like this: within my application, I
have different types of data, such as "User", "Location", "BlogPost",
and so on. Every tenant would need to have documents of those types,
but the structure of the data might be different from tenant to
tenant. For instance, my "Location" documents might have Street, City,
Zipcode, County and State for tenants in the United States, but
Prefecture, Municipality, City, District, City Block, House Number,
and Postal Code for tenants in Japan. This is a simple example- the
structure could vary quite a bit more than that, including deeply
nested fields, parent/child relationships, and so on. In addition to
the different fields, I may also want to tokenize/analyze/etc the
documents differently from tenant to tenant, depending on the needs of
the users in the tenant. Given that, for instance, a US tenant would
never want to see Location documents with a "Prefecture" field, or a
Spanish tenant might want to analyze fields in 'BlogPost' documents
differently than an English tenant, it seemed that multiple indexes
with different type mappings in each was the way to go. However, I
shall investigate the options that you presented- as I mentioned, I am
still very much an elasticsearch newbie :slight_smile: Thank you again for your
help!


(Clinton Gormley) #7

Hi Ellery

Given that, for instance, a US tenant would
never want to see Location documents with a "Prefecture" field,

If you don't set one, they won't see it. The mapping knows how to
handle the different field types, but it won't "auto-create" them in
your doc if they are missing.

Spanish tenant might want to analyze fields in 'BlogPost' documents
differently than an English tenant,

Sure - using the _analyze field I mentioned, you might be able to
achieve what you want here.

clint


(system) #8