Before continue I'l like to mention that I'm not really new at at on
ElasticSearch, I've use it since 2010 on more than one project, all this
years I've seen this very same question asked more than once (I've even
pounded it myself before). But now that I need to make a design decision I
want to take the approach of "unlearning" and try to see this
"seemingly trivial" topic with fresh eyes, alongside the opportunity to get
feedback from the great experts that I know exists on this group. (Yes,
you!)
Scenario:
Webservice that potentially can serve millions of users, each user
having on average thousands of content items on it, each content item
having ~30 metadata attributes that can be searched for: (location, title,
tags, friends, etc); User should be able to search (mostly full word
matches, no offsets or prefixes required) for content that matches a text
on some or any metadata attributes (tags and title are the common
scenario), IMPORTANT rule is that search should
be restricted to EXCLUSIVELY the content that given user has access (own
content plus shared content), shared content is the tricky one because I
could be given or withdrew access to thousands of items any any point;
response latency should be low enough to implement this on an autocomplete
fashion.
Approach A: ( "Family style" index )
Use ElasticSearch and create one (or one per attribute) big index that
spans across users; Make proper use of Aliases, filters and routing at
index/search time to enforce data locality among shards.
Pros: Straight forward, easy to implement, searches across-users are
free.
Cons: The rule of 80/20 tell us that most users won't really have much
content nor use the service frequently building the per_user filters (that
internally AFAIK are Lucene indexes too) won't be a performant task.
Also user_filters have to be recomputed each time a user is given or
removed access to content. (maybe more cons that I'm overseeing)
Approach B: ( "Per user" index )
Use ElasticSearch and create index(es) per user. I can even create an
index for my own content and the content shared with me, so "updates" on
new content shared don't affect my primary index, same as before Aliases,
routing and those goodies are my friends. On this scenario I might reduce
the number of shards to one because data indexed per user won't be
large enough to take advantage of them.
Pros: Users not using the service enough don't unnecessarily consume
memory, being this used mainly for active users. No need to re-build costly
per-user filters
Cons: I'm honestly uncertain of the extent of overhead caused by
having hundreds of thousands (potentially millions) of relatively small
indexes on a cluster, but I think that a lot of memory, CPU and IO will
be unnecessarily used just on the aggregated boilerplate tasks required to
operate them.
Approach C: ( "Geek style" indexing )
Use Redis and maintain the appropriated data structures required to
perform user-scoped text search and faceting, maybe use bitmaps to reduce
lookup times and set intersections, expire the structures so just active
users data uses memory, and load from disk (or other DB) the required
structures after user login
Pros: Lightweight, if I don't need to power of fuzzy search, prefixes,
and all those nice ElasticSearch goodies and I just need scalable and
robust word based search, this solution seems appealing, I recognize that I
have some points that I haven't completely figure out how to solve yet, but
you know, nothing that a neat hack can't achieve.
Cons: Well I need to build it :), and I might be overlooking a feature
that latter will make this a nightmare, and that was already figured out on
ES.
Approach D: ( "Crow-sourced wisdom" indexing )
Well maybe one of you have already paved this way I can teach me
something that I haven't even think about
For all this options I'll truly appreciate your feedback, (bashings are
more than welcome) I've spent a lot of hours documenting my self on this
and tomorrow I'll start digging further and playing with code around option
C, I hope that I can get good and timely enough feedback to take the
right decisions. If I wrote something that is plainly wrong or if seems
like I don't have the right assumption please let me know. This is a really
common question to people starting to use ES (and others with some miles on
it, too) so is a good place to teach us.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.