Which mapping strategy for frequently modifiable mapping

AlexC · December 2, 2013, 10:18pm

First, some background information: my ES cluster will store accounts and
users. Users belong to accounts (many-to-1), and each account can have
custom user attributes (of various types and possibly having different
analyzers) modifiable at runtime. There will be thousands of accounts in ES
with possibly hundreds of millions of users in total.

I have about three options so far and I really cannot decide which one will
work best.

The first solution is to store all users in the same index using a single
mapping - the removal or the change of a custom attribute will require a
full reindex of all users, which would probably take quite a while. In
addition to that, the mapping will be huge, as it will contain all the
custom attributes of each account.

The second option is to partition users by account using indexes - i.e. one
index will store only the users belonging to one account.

The third option is to partition users by account using mappings - the
users will be stored in the same index, but there will be a different
mapping for each account.

From a development point of view, option #2 will probably be the best, but
I assume the large number of indexes will have a negative impact on the
ES/Lucene performance.
Option #3 still looks better than #1, but since mappings don't support
aliases, I will have to account for a dynamic mapping name when I query the
users for a given account.

alex

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8fc322fc-15d5-44c1-be0a-bf7100c09f5c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

spinscale · December 3, 2013, 8:43am

Hey,

take your time and watch kimchy talking about different data flows and how
you can make use of them using elasticsearch. There might be a solution to
mix some of your suggestions

Also parent-child is something to look at. See the links below.

--Alex

On Mon, Dec 2, 2013 at 11:18 PM, AlexC acojocaru@pingidentity.com wrote:

First, some background information: my ES cluster will store accounts and
users. Users belong to accounts (many-to-1), and each account can have
custom user attributes (of various types and possibly having different
analyzers) modifiable at runtime. There will be thousands of accounts in ES
with possibly hundreds of millions of users in total.

I have about three options so far and I really cannot decide which one
will work best.

The first solution is to store all users in the same index using a single
mapping - the removal or the change of a custom attribute will require a
full reindex of all users, which would probably take quite a while. In
addition to that, the mapping will be huge, as it will contain all the
custom attributes of each account.

The second option is to partition users by account using indexes - i.e.
one index will store only the users belonging to one account.

The third option is to partition users by account using mappings - the
users will be stored in the same index, but there will be a different
mapping for each account.

From a development point of view, option #2 will probably be the best, but
I assume the large number of indexes will have a negative impact on the
ES/Lucene performance.
Option #3 still looks better than #1, but since mappings don't support
aliases, I will have to account for a dynamic mapping name when I query the
users for a given account.

alex

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8fc322fc-15d5-44c1-be0a-bf7100c09f5c%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM9oT11118sxeCVRbWfqq1Ehcz_XMCZe9XL6R2t64n_z_Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

AlexC · December 3, 2013, 4:00pm

The 'user data flow' described by kimchy in his presentation matches
exactly my use case. He makes it obvious that the way to go is storing all
emails into one big index (with enough shard allocation) and using the
routing/filtering options to provide access to various slices of the index
(i.e. the emails belonging to one user). So option #2 in my original email
is off.

What he doesn't take into account is the need for changing the mapping
definition, which requires a full reindex. Assuming the index contains a
billion documents, they will all have to be re-indexed, which could take a
very long time and which requires doubling the size needed by the index (as
both the old and the new one will exist at the same time while the reindex
operation is in progress).

I still believe option #3 (a mapping per user to store the user's emails,
which would result in a large number of mappings, equal to the number of
users) is the best option considering the need for reindexing. It's just
that I haven't found any references/recommendations to using this technique
to 'partition' documents.

alex

On Tue, Dec 3, 2013 at 3:43 AM, Alexander Reelsen alr@spinscale.de wrote:

Hey,

take your time and watch kimchy talking about different data flows and how
you can make use of them using elasticsearch. There might be a solution to
mix some of your suggestions

Also parent-child is something to look at. See the links below.

Elasticsearch Platform — Find real-time answers at scale | Elastic
Elasticsearch Platform — Find real-time answers at scale | Elastic
Elasticsearch Platform — Find real-time answers at scale | Elastic

--Alex

On Mon, Dec 2, 2013 at 11:18 PM, AlexC acojocaru@pingidentity.com wrote:

First, some background information: my ES cluster will store accounts and
users. Users belong to accounts (many-to-1), and each account can have
custom user attributes (of various types and possibly having different
analyzers) modifiable at runtime. There will be thousands of accounts in ES
with possibly hundreds of millions of users in total.

I have about three options so far and I really cannot decide which one
will work best.

The first solution is to store all users in the same index using a single
mapping - the removal or the change of a custom attribute will require a
full reindex of all users, which would probably take quite a while. In
addition to that, the mapping will be huge, as it will contain all the
custom attributes of each account.

The second option is to partition users by account using indexes - i.e.
one index will store only the users belonging to one account.

The third option is to partition users by account using mappings - the
users will be stored in the same index, but there will be a different
mapping for each account.

From a development point of view, option #2 will probably be the best,
but I assume the large number of indexes will have a negative impact on the
ES/Lucene performance.
Option #3 still looks better than #1, but since mappings don't support
aliases, I will have to account for a dynamic mapping name when I query the
users for a given account.

alex

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8fc322fc-15d5-44c1-be0a-bf7100c09f5c%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/H5-NV1MqJxs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGCwEM9oT11118sxeCVRbWfqq1Ehcz_XMCZe9XL6R2t64n_z_Q%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHUBgW-X1nbO-O9uZpy0zThWM9WnnLq%3D0WYB0KrZE%2BxHsQJY0g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.