Efficient Subquery Combinations

Hello elastic enthusiasts,

I think I have a fairly unique Elastic use case and I'm struggling to identify an efficient way to structure the queries I need to run. I'm new to the elasticsearch forums and elastic in general so please point me to the right place if this is posted in the wrong spot. And thanks a bunch for your help! Here's a simplified example of what I want to do:

I have 4 subqueries, let's call them A, B, C, and D. Each query is a semantic similarity style query, where I want to match documents that have a sufficiently high cosineSimilarity against a dense_vector field AND match a few other terms filters, which are different across the four queries. I have previously done this with a script_score query and cosineSimilarity in the script, and through a knn query. Either way is acceptable for my purposes I think. Each query matches 100-500k documents.

This is where it gets tricky. I have two use cases.

Use case 1 -- I want to return the count of documents from each subquery minus the documents belonging to higher subqueries. Let's call those A', B', C', and D' where:

  • A' = A
  • B' = B - A
  • C' = C - (A + B)
  • D' = D - (A + B + C)

Use case 2 -- I want to get aggregations of all unique documents from A, B, C, and D. Let's call that set E. I believe this is trickier than it sounds, due to each subquery being a combination of terms filters and semantic similarity, so I can't just nest all the conditions under a should key.

  • E = dedupe(A + B + C + D)

I care about count of documents, i.e. aggregation values, not actual documents, so I wouldn't need to retrieve any docs, just get the aggregation counts. Is there an efficient way to structure these queries? Since A', B', C', D', and E all require recomputing the same set of A, B, C, and D, it feels like there should be an efficient way to do it, but I haven't come up with anything. A few things I've already tried are:

  • Setting up the B', C', and D' as script_score queries where I embed the definition of the exclusionary query(s) to remove in the script and then manually set score to 0.0 with a min_score > 0 if there is a match.
    • This works, but is brittle and feels inefficient.
    • Is there a better way to do this with retrievers or some other sort of child abstraction? I come from a SQL background and what I really want right now is an elastic equivalent of a CTE.
  • Using an rrf retriever query to get E, where A, B, C, and D are standard retrievers in the rrf retrievers list. I'm not sure if this would work, since use of rrf requires the enterprise license (I'm not opposed to upgrading my license, I just want to verify it will work first). Is there a way to combine child retrievers without rrf?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.