(not)-yet another Index-per-User scenario on real life - ElasticSearch, Redis, something else?


(AGuereca) #1

Before continue I'l like to mention that I'm not really new at at on
ElasticSearch, I've use it since 2010 on more than one project, all this
years I've seen this very same question asked more than once (I've even
pounded it myself before). But now that I need to make a design decision I
want to take the approach of "unlearning" and try to see this
"seemingly trivial" topic with fresh eyes, alongside the opportunity to get
feedback from the great experts that I know exists on this group. (Yes,
you!)

Scenario:
Webservice that potentially can serve millions of users, each user
having on average thousands of content items on it, each content item
having ~30 metadata attributes that can be searched for: (location, title,
tags, friends, etc); User should be able to search (mostly full word
matches, no offsets or prefixes required) for content that matches a text
on some or any metadata attributes (tags and title are the common
scenario), IMPORTANT rule is that search should
be restricted to EXCLUSIVELY the content that given user has access (own
content plus shared content), shared content is the tricky one because I
could be given or withdrew access to thousands of items any any point;
response latency should be low enough to implement this on an autocomplete
fashion.

Approach A: ( "Family style" index )
Use ElasticSearch and create one (or one per attribute) big index that
spans across users; Make proper use of Aliases, filters and routing at
index/search time to enforce data locality among shards.
Pros: Straight forward, easy to implement, searches across-users are
free.
Cons: The rule of 80/20 tell us that most users won't really have much
content nor use the service frequently building the per_user filters (that
internally AFAIK are Lucene indexes too) won't be a performant task.
Also user_filters have to be recomputed each time a user is given or
removed access to content. (maybe more cons that I'm overseeing)

Approach B: ( "Per user" index )
Use ElasticSearch and create index(es) per user. I can even create an
index for my own content and the content shared with me, so "updates" on
new content shared don't affect my primary index, same as before Aliases,
routing and those goodies are my friends. On this scenario I might reduce
the number of shards to one because data indexed per user won't be
large enough to take advantage of them.
Pros: Users not using the service enough don't unnecessarily consume
memory, being this used mainly for active users. No need to re-build costly
per-user filters
Cons: I'm honestly uncertain of the extent of overhead caused by
having hundreds of thousands (potentially millions) of relatively small
indexes on a cluster, but I think that a lot of memory, CPU and IO will
be unnecessarily used just on the aggregated boilerplate tasks required to
operate them.

Approach C: ( "Geek style" indexing )
Use Redis and maintain the appropriated data structures required to
perform user-scoped text search and faceting, maybe use bitmaps to reduce
lookup times and set intersections, expire the structures so just active
users data uses memory, and load from disk (or other DB) the required
structures after user login
Pros: Lightweight, if I don't need to power of fuzzy search, prefixes,
and all those nice ElasticSearch goodies and I just need scalable and
robust word based search, this solution seems appealing, I recognize that I
have some points that I haven't completely figure out how to solve yet, but
you know, nothing that a neat hack can't achieve.
Cons: Well I need to build it :), and I might be overlooking a feature
that latter will make this a nightmare, and that was already figured out on
ES.

Approach D: ( "Crow-sourced wisdom" indexing )
Well maybe one of you have already paved this way I can teach me
something that I haven't even think about :slight_smile:

For all this options I'll truly appreciate your feedback, (bashings are
more than welcome) I've spent a lot of hours documenting my self on this
and tomorrow I'll start digging further and playing with code around option
C, I hope that I can get good and timely enough feedback to take the
right decisions. If I wrote something that is plainly wrong or if seems
like I don't have the right assumption please let me know. This is a really
common question to people starting to use ES (and others with some miles on
it, too) so is a good place to teach us.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Shay Banon) #2

I would go for Approach A, the cons are not that big, a filter per alias is very cheap (its not a Lucene index), and actually having to "fill" it uses the filter cache but even if it ends up expiring its not expensive at all to fill it.

On Apr 30, 2013, at 8:53 AM, AGuereca aguereca@gmail.com wrote:

Before continue I'l like to mention that I'm not really new at at on ElasticSearch, I've use it since 2010 on more than one project, all this years I've seen this very same question asked more than once (I've even pounded it myself before). But now that I need to make a design decision I want to take the approach of "unlearning" and try to see this "seemingly trivial" topic with fresh eyes, alongside the opportunity to get feedback from the great experts that I know exists on this group. (Yes, you!)

Scenario:
Webservice that potentially can serve millions of users, each user having on average thousands of content items on it, each content item having ~30 metadata attributes that can be searched for: (location, title, tags, friends, etc); User should be able to search (mostly full word matches, no offsets or prefixes required) for content that matches a text on some or any metadata attributes (tags and title are the common scenario), IMPORTANT rule is that search should be restricted to EXCLUSIVELY the content that given user has access (own content plus shared content), shared content is the tricky one because I could be given or withdrew access to thousands of items any any point; response latency should be low enough to implement this on an autocomplete fashion.

Approach A: ( "Family style" index )
Use ElasticSearch and create one (or one per attribute) big index that spans across users; Make proper use of Aliases, filters and routing at index/search time to enforce data locality among shards.
Pros: Straight forward, easy to implement, searches across-users are free.
Cons: The rule of 80/20 tell us that most users won't really have much content nor use the service frequently building the per_user filters (that internally AFAIK are Lucene indexes too) won't be a performant task. Also user_filters have to be recomputed each time a user is given or removed access to content. (maybe more cons that I'm overseeing)

Approach B: ( "Per user" index )
Use ElasticSearch and create index(es) per user. I can even create an index for my own content and the content shared with me, so "updates" on new content shared don't affect my primary index, same as before Aliases, routing and those goodies are my friends. On this scenario I might reduce the number of shards to one because data indexed per user won't be large enough to take advantage of them.
Pros: Users not using the service enough don't unnecessarily consume memory, being this used mainly for active users. No need to re-build costly per-user filters
Cons: I'm honestly uncertain of the extent of overhead caused by having hundreds of thousands (potentially millions) of relatively small indexes on a cluster, but I think that a lot of memory, CPU and IO will be unnecessarily used just on the aggregated boilerplate tasks required to operate them.

Approach C: ( "Geek style" indexing )
Use Redis and maintain the appropriated data structures required to perform user-scoped text search and faceting, maybe use bitmaps to reduce lookup times and set intersections, expire the structures so just active users data uses memory, and load from disk (or other DB) the required structures after user login
Pros: Lightweight, if I don't need to power of fuzzy search, prefixes, and all those nice ElasticSearch goodies and I just need scalable and robust word based search, this solution seems appealing, I recognize that I have some points that I haven't completely figure out how to solve yet, but you know, nothing that a neat hack can't achieve.
Cons: Well I need to build it :), and I might be overlooking a feature that latter will make this a nightmare, and that was already figured out on ES.

Approach D: ( "Crow-sourced wisdom" indexing )
Well maybe one of you have already paved this way I can teach me something that I haven't even think about :slight_smile:

For all this options I'll truly appreciate your feedback, (bashings are more than welcome) I've spent a lot of hours documenting my self on this and tomorrow I'll start digging further and playing with code around option C, I hope that I can get good and timely enough feedback to take the right decisions. If I wrote something that is plainly wrong or if seems like I don't have the right assumption please let me know. This is a really common question to people starting to use ES (and others with some miles on it, too) so is a good place to teach us.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(AGuereca) #3

Thanks for the advice Shay, I'll test A and figure our a meaningful way to
test filter generation time and it's cache hit ratio.

On Thursday, May 2, 2013 7:25:59 AM UTC-7, kimchy wrote:

I would go for Approach A, the cons are not that big, a filter per alias
is very cheap (its not a Lucene index), and actually having to "fill" it
uses the filter cache but even if it ends up expiring its not expensive at
all to fill it.

On Apr 30, 2013, at 8:53 AM, AGuereca <ague...@gmail.com <javascript:>>
wrote:

Before continue I'l like to mention that I'm not really new at at on
ElasticSearch, I've use it since 2010 on more than one project, all this
years I've seen this very same question asked more than once (I've even
pounded it myself before). But now that I need to make a design decision I
want to take the approach of "unlearning" and try to see this
"seemingly trivial" topic with fresh eyes, alongside the opportunity to get
feedback from the great experts that I know exists on this group. (Yes,
you!)

Scenario:
Webservice that potentially can serve millions of users, each user
having on average thousands of content items on it, each content item
having ~30 metadata attributes that can be searched for: (location, title,
tags, friends, etc); User should be able to search (mostly full word
matches, no offsets or prefixes required) for content that matches a text
on some or any metadata attributes (tags and title are the common
scenario), IMPORTANT rule is that search should
be restricted to EXCLUSIVELY the content that given user has access (own
content plus shared content), shared content is the tricky one because I
could be given or withdrew access to thousands of items any any point;
response latency should be low enough to implement this on an autocomplete
fashion.

Approach A: ( "Family style" index )
Use ElasticSearch and create one (or one per attribute) big index
that spans across users; Make proper use of Aliases, filters and routing at
index/search time to enforce data locality among shards.
Pros: Straight forward, easy to implement, searches across-users are
free.
Cons: The rule of 80/20 tell us that most users won't really have
much content nor use the service frequently building the per_user filters
(that internally AFAIK are Lucene indexes too) won't be a performant task.
Also user_filters have to be recomputed each time a user is given or
removed access to content. (maybe more cons that I'm overseeing)

Approach B: ( "Per user" index )
Use ElasticSearch and create index(es) per user. I can even create an
index for my own content and the content shared with me, so "updates" on
new content shared don't affect my primary index, same as before Aliases,
routing and those goodies are my friends. On this scenario I might reduce
the number of shards to one because data indexed per user won't be
large enough to take advantage of them.
Pros: Users not using the service enough don't unnecessarily consume
memory, being this used mainly for active users. No need to re-build costly
per-user filters
Cons: I'm honestly uncertain of the extent of overhead caused by
having hundreds of thousands (potentially millions) of relatively small
indexes on a cluster, but I think that a lot of memory, CPU and IO will
be unnecessarily used just on the aggregated boilerplate tasks required to
operate them.

Approach C: ( "Geek style" indexing )
Use Redis and maintain the appropriated data structures required to
perform user-scoped text search and faceting, maybe use bitmaps to reduce
lookup times and set intersections, expire the structures so just active
users data uses memory, and load from disk (or other DB) the required
structures after user login
Pros: Lightweight, if I don't need to power of fuzzy search,
prefixes, and all those nice ElasticSearch goodies and I just need scalable
and robust word based search, this solution seems appealing, I recognize
that I have some points that I haven't completely figure out how to solve
yet, but you know, nothing that a neat hack can't achieve.
Cons: Well I need to build it :), and I might be overlooking a
feature that latter will make this a nightmare, and that was already
figured out on ES.

Approach D: ( "Crow-sourced wisdom" indexing )
Well maybe one of you have already paved this way I can teach me
something that I haven't even think about :slight_smile:

For all this options I'll truly appreciate your feedback, (bashings are
more than welcome) I've spent a lot of hours documenting my self on this
and tomorrow I'll start digging further and playing with code around option
C, I hope that I can get good and timely enough feedback to take the
right decisions. If I wrote something that is plainly wrong or if seems
like I don't have the right assumption please let me know. This is a really
common question to people starting to use ES (and others with some miles on
it, too) so is a good place to teach us.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(magneticz) #4

Hi, I am working on a project where I use ES. My scenario is a that there are various roles which are given to users and users can access content according to those roles. Now, roles can be given access to content or in opposite, roles can be deprived of access rights. Also there might be new roles created. Given this scenario and that I am new to ES, I assume the best approach would be Approach A, but I am not sure how this approach works and how it should be done. Any help and guidance would be great


(system) #5