Is there a way to do something like LuceneCustomFilter and combining (OR or AND) BitArrays in ElasticSearch?


(Adam Brown) #1

To give some context. I have about 50,000 users who each has a set of
documents that they have access to (a list of document ids). These 50,000
users get organized into some 5,000 groups, which may be nested. A group
can contain users or other subgroups. Users can belong to more than one
group, and they can move. Groups have access to the union of the document
sets of their constituent users, or subgroups. I need a fast way to limit
query and facet operations to unions or intersections of the document sets
for these groups or users.

I found this during my initial research:

It looks like it is possible to do what I need to do with Lucene using the
LuceneCustomFilter. I can do an index scan and check each document against
each user's list of ids and save true or false to each user's BitArray.
Then I can make an Intersect CustomFilter that takes multiple sub filters
and produces a BitArray by ANDing the filters, and a Union CustomFilter
that takes sub filters and ORs them. As long as the index doesn't change I
can just Cache some 55K BitArrays, and I am done. I can just make the 5,000
group BitArrays by ORing the appropriate user BitArrays.

I will probably be building a new index daily, for other reasons, so what I
can probably do is perform this operation when I build the index. As new
things are added to the index, I think I can use new segments, and exclude
the new segment from search until I have a chance to scan the new segment.
Then I can probably just scan the current new segment periodically. Then I
can OR each of the 50,000 previous BitArrays with the BitArray from the new
segment (or maybe just append, or maybe you need a separate BitArray per
segment. I still need to figure that out exactly.).

I have been searching for something similar in ElasticSearch. I am already
using ElasticSearch and am quite happy with all of the other features it
has. I'd rather not have to reinvent everything I am using in ElasticSearch
just to get the features I need from Lucene. It seems, since ElasticSearch
is built on Lucene, it might be able to provide similar functionality. I am
having trouble figuring out what to search for though. So far I have been
able to find no information on doing something similar using ElasticSearch
directly. If someone could point me in the direction of what I should be
searching for, or send me to some documentation, that would be great. I'm
not opposed to the idea of writing an ElasticSearch extension if that's
what is required, but I'm not really sure where to start with that either.

A possible alternative method to getting the initial document sets is
querying per user. The way the list of ids for each user is generated in
the first place is by doing a series of 10 to 30 queries, depending on the
user, and taking the union of the ids from the result sets. If there was a
way to get something like a BitArray out of the result of a query, and then
OR multiple result sets to make the super set, and cache the superset
indefinitely until manually evicted (or provide manual control over when
and how long to cache), that could also do the job.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #2