Filtered aliases and huge ids/terms filter

Hi,

Here's the problem I'm trying to solve - we have quite big search catalog (hundreds of millions of documents) which we need to split into two sets: a relatively small one (1-2%) and a big one (remaining 98-99%). Each set should be queried independently (it would be great if we could get X results from both groups a the same time with a single search query, but that's not a requirement). The thing is that our documents change their "set membership" quite often and we don't want that to trigger document reindexing (which is a very complex task for us for various reasons). It's also worth mentioning that we're fine with changing documents set membership at once in periodic bulk operation (this change does not need to be a "realtime").

So we thought about utilizing filtered aliases for that. We started with ids filter (with ids of the smaller set, ~1.5M ids, ~20MB in JSON format). Right after creating such alias we noticed increased response time from the cluster, although the alias was not really used by any query. We removed the alias and things went back to normal.

Next thing we tried was terms filter on "_id" field with "terms lookup mechanism". So we created our "lookup document" (again, ~20MB of source with ids of the documents from the smaller set), made sure that it's replicated among all nodes and used that in terms filter in a search query that normally takes ~100ms to complete (without that filter). After proper warmup we got response times around 1-3s (plain execution mode, cached), which we can't accept. We experimented with different execution modes with no luck.

It's probably worth mentioning that we still use ES 1.x (1.7.3 to be more specific).

Any suggestions how to solve that issue?

Best,
Tomasz

1 Like

Can you quantify what you mean by

change their "set membership" quite often

Is it possible to calculate this membership based on the contents of the documents? Can you adapt the document contents so that this membership can be determined by ES instead of passing an explicit list of document ids?

Can you adapt the document contents so that this membership can be determined by ES instead of passing an explicit list of document ids?

But that would require document reindexing whenever it moves from one set to the other and that's something we would like to avoid.

no, what I meant is if you can enrich the original document structure so that this dynamic membership can be defined in terms of standard filters.