I would like to use the significant terms aggregations to do the comparison based on something else than a ratio: the significant term aggregation is used to compare the proportion of an element against a foreground and a background set.
If I understand correctly, all the possible heuristics (jlh, gnd, percentage or custom...) rely on the cardinality of the different sets:
Number of documents the term appears in the subset.
Number of documents the term appears in the superset.
Number of documents in the subset.
Number of documents in the superset.
It would be interesting to be able to compare these different sets based on something else than the frequency, for example by using the result of a different aggregation.
i.e. : significant products per city compared to the country based on the avg product price (and not the # of sales)...
In order to do that, it would be great if the significant term aggregation was a pipeline aggregation and if we could use the buckets-path syntax to define a script_heuristic.
What do you think about supporting this syntax?
I tried to find a workaround by using a regular aggregation and the bucket-filter aggregation but I'm quickly limited by not being able to reference a parent aggregation from a buckets-path ...
That pre-dates pipeline aggregations so the idea was to access properties of docs as part of the aggregation phase rather than post-processing aggregation values using a pipeline agg.
I'm not sure things like average prices make sense to compare with the significance algorithms. For me significance is a measure of how "bought-into" something another thing is. Set overlaps. Your "significant other" is someone you both invest a lot of time with. A "significant investor" is someone who has invested a large (to you and them) sum of money. A significant term is something like h5n1 that only seems to appear in a particular context e.g. "bird flu" search results. These are measures of a set size (total time, total money and total docs) and the extent to which they overlap with another set. So sums of a resource. We can visualize the size of set intersections as Venn diagrams.
Averages maxs or mins I suspect are of less use as they don't quantify the size of the set and are therefore of less use in examining set overlaps.
You are right when you describe the meaning of "significant" :
defining some sets
comparing these sets
I don't see why #2 should rely only on the frequency of a value...
Let's say that you are able to calculate the strength of the illness, may be considering the number of deaths and the number of symptoms or... who knows...
So let's say something more complex than just looking at one unique cardinality and that it requires to compute a KPI... by using the matching documents in the different sets.
Imagine that your document represents an illness, may be the state of an illness: one document when the disease occurs and one when the patient dies. This is Friday, let's have some fun! More importantly, you have enough information in these documents to compute a KPI.
Now based on the different sets, you are going to look at the intersections, basically at your Venn diagrams. Then you want to compare the different intersections: this is the point #1. You go through all the documents in these sets, and you count the number of occurences. But just like you are counting, you can also do a more complex computation, like computing a KPI that "represents" your set. The count is not the only measure that represents your set. Any computation can be representative. And these computations can be done by an aggregation.
So h5n1 could be significant in SF because the number of occurences is abnormally high compared to the number of occurences in the country (7% instead of 1%).
But it could also be because the "strength" of h5n1 in SF is much higher than in the country... We can not explain why... but we detected this anomaly based on something else than a frequency...
In order to do that, since the KPI was computed using an aggregation, you need to be able to reference the aggregation result (a property of your bucket now) in your significant term aggregation heuristic...
It's just a way to take advantage of the capability of the significant terms aggregation to identify sets (point #1) and to compare them (point #2) but by using a different metric than a cardinality.
Here is perhaps a more formal definition of why averages won't work when quantifying set intersections.
A is the candidate significant term e.g. h5n1.
B is the result set e.g. "bird flu" search
U is all docs.
Significance of A to the result set is determined by some statistical comparison of A's foreground popularity Vs its background popularity i.e. A ∩ B / B and A / U
The following assertions must be true:
A < U
B < U
A ∩ B <= A
A ∩ B <= B
We can size these sets by numbers of docs or perhaps sums of properties of docs e.g. price and all assertions will hold true.
However if you try use an average of a value e.g. price to describe the content of each of the regions then you can break all of these assertions and we can't reason about set intersections.
We have documents that represent disease cases:
(patient_name, disease_name, duration, city)
By comparing the different sets, we want to end up building a table similar to the following:
So, we should identify H5N1/SF as an anomaly; that is, the average duration of H5N1 in SF vs. globally (10:2) is more significant than the average du
ration of Malaria in SF vs. globally (296:277). H5N1 in SF would have a higher anomaly "score" than Malaria in SF.
We are still reasoning about sets, but not only about cardinality - we consider the cardinality when calculating the average duration, but we also need to compute the duration sum (duration sum/count = avg).
In our final step, I agree that we are not comparing to U, but to B. That is, after finding the average duration of each disease for the city and globally, we no longer care about the U (all diseases for all cities).
Your use case is comparing 2 numbers but significance heuristics use all 4 numbers to examine set intersections differently. They might reveal these insights:
H5n1 accounts for more filled beds in SF hospitals than in other cities (proportionately)
In company investments data the amounts of money invested in SF's technology sector is disproportionate to investments in SF's other industry sectors.
In these cases the average amounts (durations/investments) may be unremarkable but the extent to which the subjects (h5n1/tech) dominate other forms of subjects proportional to city scale is the remarkable characteristic.
I understand your examples and why they can be achieved using the current behavior of the significant term aggregation (although I think that the second one would require summing up the investments in order to compare the sets, and so would face the same issue than my use case?)
My use case is slightly different but still about sets and heuristics
Let me try to clarify my goal: I had the feeling that my use case was not too far from what could be achieved using the significant term aggregation. I understand the "limitations" but I was wondering if by enhancing the significant term aggregation to support a more customizable form of heuristic (based on a previous aggregation) would be enough?
So I can see 3 potential answers:
Yes, in theory, it would be enough or it could be valuable for some other use cases. (in this case I would be happy to try to contribute to this enhancement.)
No, it won't be enough or being able to compare sets based on a different metrics than their cardinalities can not be considered as an enhancement of the significant term aggregation
The significant term aggregation is not the solution but you can achieve your use case using this other amazing elasticsearch feature...
Regarding #3, I tried to to the same thing using regular aggregations and using the bucket-filter aggregation in order to filter one bucket based on a value computed by a parent aggregation (avg, from extended stats agg).
But since a pipeline aggregation can not reference a parent aggregation (I would also be interested to understand why btw), I end up having to decompose my query in two distinct queries: roughly, the first one computing the average and the second one using this average to filter the aggregation results.
We already have pluggable significance heuristics. They each work with the same 4 numbers that describe set sizes but score with different levels of precision/recall. Some heuristics favour rare terms that are high precision (exclusively related to your search results) but low recall (very rare terms may be not be very useful for expanding a search to find other similar docs.) Other heuristics reward more common terms that are low precision and high recall. It's a balancing act depending on your precision/recall needs.
Correct. My issue I linked to was a proposal to allow such summing of document properties. The existing heuristics would reason about the same 4 numbers to describe levels of set intersections. It's just the set size numbers may describe sums of money rather than docs and as such may need to use floats not longs. Sums are required here rather than averages etc otherwise the invariants of set logic I outlined previously would not hold.
I think it might be answer 3 given averages wouldn't work with existing significant_terms. You might be able to avoid the 2 queries you mention in your workaround if you also use the global agg to summarise ALL docs alongside your normal aggs to summarise query results. That way you get background and foreground values in one request and you can do your computation in the client to spot any big movers between the two.