Filtering to a subset of the full index


(Phil Messenger) #1

I have a large (tens of billions of docs) index spread across a bunch of machines. I'd like to constrain a query to a subset of that index. Example: each doc might be a message with an author, and I want to constrain a search to messages written by a subset of all authors.

The catch is that this subset may still be a million people, so clearly a boolean query isn't going to work.

I'm wondering if I can do this with a custom plugin. I could probably do something with a custom scorer, but that doesn't feel very efficient. I'd need to keep a cache of docId -> authourId in memory for this to work at a reasonable speed.

Am I missing something obvious?


(Nik Everett) #2

Bool queries should work fine. Just stick a term filter in the should
clause. You can limit the fanout with routing I'd all your authors are
small. That can create hotspots if some are large but you can work around
that in other ways too.


(Phil Messenger) #3

Will that perform ok if there are say 1,000,000 terms in the filter? Seems like a lot of data to be sending and parsing for each request.

I have another issue which is that I don't necessarily have access to a canonical list of ids. Internally we represent them as bloomfilters, so my ideal scenario would be to pass the bloomfilter and use that directly.

I found this https://groups.google.com/forum/#!topic/elasticsearch/dP3R2Gc4J-g which got me thinking about a custom filter. Unfortunately Lucene filters are expected to return a DocIdSet which allows doc ids to be iterated.


(Lee Kohn) #4

A system I'm working on is running into pretty much the exact same issue. How did it turn out? I'm considering doing it via a scripted filter to implement the bloom filter.


(Phil Messenger) #5

We got acceptable performance (for our use case) by using a term filter, referencing terms stored in a doc in the index. This meant we avoided the need to sent dozens of megs of JSON around for every query.

At the time I couldn't work out how to use a bloomfilter because all of the internal filtering stuff is built around DocSets, and the ability to iterate over them. That's not an operation one can do on a bloomfilter. If I were to look at it again I'd probably look at a compressed bitset - roaringbitmaps for example.


(system) #6