How does PercolatorQuery work when query extraction failed

Let me explain you our percolator architecture

Our percolator docs look like this

{
   "domain": <>,
   "query": { ..  }
}

We have around 100K percolators registered. The number of percolators between domains can vary a lot. Whenever we percolate a document, we want to percolate only against the domains for which the given document is part of

In 1.7, we made use of meta filters and filtered out percolators
When we migrated to 7.2 with similar architecture, the performance degraded drastically

The query we hit was

{
    "query": {
         "bool": {
               "filter": [
                     { "term": { "domain": <> } },
                     { "percolate": { "document": {...} } }
               ]
         }
    }
}

FYI we have few queries in which ES failed to extract terms (around 3000)

I ran the profiler API against the above query and what I saw was the following execution

– PercolateQuery
   – CoveringQuery
   – TermQuery [query.extraction_result:failed]
– TermQuery [on domain]

I'd like to know how does the PercolateQuery actually runs? Like in my query above, when will the domain filter be applied? Before or after the document has been tested against the potential candidates? Because the above query clearly isn't making use of the meta filters IMO

Would you suggest that the domain filter be put inside the percolator query itself so that ES can extract it out during registration and hence make a better decision during percolation?

We did that eventually and did see a performance improvement. What surprised me was, just providing the domain gave a performance boost. we had few other meta's as well that we thought of injecting in the percolator query in the hopes that the candidate queries would reduce further. But when we added more such terms, the performance degraded again. Any idea why would that be? Unlike domain, the other filters don't have a lot of distinct values [maybe 2-3]

Also what about queries that ES couldn't extract upon? They get executed blindly always? I guess in that case, it makes sense to put the meta filters inside the percolator query itself. Let me know why adding more meta filters could degrade performance

I suspect it has to do with the way CoveringQuery works. I'm not really sure though. Could you explain?

Actually no, whenever our percolator documents have fields that span across domains, ES results in a lot of false positives

Can anyone think of a solution?

A PercolatorQuery gets converted to a CoveringQuery & TermQuery [extraction failed]. And since CoveringQuery is kind of like a "should", then because of our design, there are too many false positives ES is testing against!

Do you have many bool queries with should clauses stored in your percolator index that would also match most of the documents you are percolating against?

Percolating with a document that matches multiple of those should clauses can cause lots of false positives in the first query phase and thus much longer query runtimes. The percolator is not aware that only one of those "should" clauses is required for a match and thus ignores other conditions from the queries that would have helped in filtering possible candiates for a match.

Elasticsearch only uses a minimum number of conditions that have to match in the CoveringQuery and looses a lot of the original query structure because of this. This can cause must clauses that would have filtered out most stored queries to be ignored just because there were other should clauses that did match the document.

@Daniel_Penning I'm pretty sure most of the queries make use of must and filter only. There might be very rare usage of should clauses. During extraction, does ES deal with terms in should vs must in a different way by any chance?

The problem we think is that, since all of our percolators [of all the domains are in the same index, and since most of our field values are having fixed set of values [hence they become common across domains], ES gathers percolators from different domains as well which is causing lengthy execution times. That's why things worked just fine in 1.7 because we always made sure to filter out percolators of the domain of which the percolated document is part of!

Unfortunately we can't have indices per domain because we manage a lot of domains

I've opened up this issue as well https://github.com/elastic/elasticsearch/issues/47095

Anyways let me know if ES handles must and should clauses during extraction in different way

We make use of terms, term, range & match mostly

Terms and match queries are handled by the percolator in the same way as a bool should query with multiple term queries. So many false positives could be causing your performance problems. Elasticsearch stores a count in each document with the minimum number of extracted terms that is used as the minimum number of required matches for the CoveringQuery. For a bool query with must clauses this equals the number of conditions, but for terms/match queries and should clauses of a bool query this equals 1 or the minimum_should_match that was set for those queries.

Have a look at this article which gives a brief explanation on how queries are executed in elasticsearch.

Ahhh now I know why a Covering Query is used instead of a simple bool query with should clauses for each term. The minimum should match is per document!

Thanks for the explanation @Daniel_Penning. Now I get what the author meant by per-document in this issue https://github.com/elastic/elasticsearch/issues/26307

Is it possible that I can update the query.extracted_terms by running an update op? That way I can nuke off the fields that ES has extracted by itself and store my metadata to reduce the false positives! I know it will be a hack and unsure if it can be updated as well

ok turns out we can't :frowning:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.