Thanks for the help, it's much appreciated! I didn't know about Debug.explain()
.
Let's dig into my use case step by step. Suppose an index has, among others, fields left
and right
. For starters, let's say we want to find out the number of distinct values in both left
and right
across all documents. That is, each value should be counted only once, even if it is found both in left
and right
. If we run two cardinality aggs and add the results together, some values may be counted twice. The only way I've found to do it (without returning all values and checking for duplicates by hand on the app side) is to run a scripted cardinality aggregation, where the script would be [doc.left, doc.right]
(like here). This works fine.
However, the problem I'm dealing with is a bit more complex. I don't want to include all documents in the aggregation above, but left
values based on one, and right
values based on another criterion. So, take the index, apply filter1
to it and pick only the left
values of the resulting documents. Then apply filter2
and pick only the right
values of the documents that match. Finally, put all of these values in the same basket and find the number of distinct values.
Initially, I wanted to use named queries for this and just add an if
to the script that would decide to return doc.left
, doc.right
or both, based on the matching queries. However, since this is unsupported, I tried abusing document score to do the same thing (find out which filters matched based on the score). I failed because of the described behavior. Too bad, given that performance-wise it works fine.
I could also have two filter aggregations and compute cardinality on each of them, but then I wouldn't know how to combine the results with accounting for duplicates.
The only accurate way I can think of is returning to the application all distinct values of left
with filter1
applied, then doing the same for right
and filter2
, and finally computing the cardinality on the app side. This, however, would be too slow.
If you think there's a better way to solve this with reasonable performance, even by running multiple queries, I'd be interested to know.