It's a little confusing how is the score of Significant Term aggregation calculated. Another thing is about the criteria of highlighting significant items. I know that a term is considered significant if there is a noticeable difference in the frequency in which a term appears in the subset and in the background, but it's unclear on the criteria of the "noticeable difference".
Can anybody please clarify these questions? Thanks.
The scores are derived from the doc frequencies in foreground and background sets. The absolute change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the relative change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.
JLH is the default scoring algo but there are others and they all use the same 4 numbers as input:
size of foreground set
frequency of term in foreground set
size of background set
frequency of term in background set.
This video provides a visual demo of scores on various queries and the positive effects of sampling.
If I have 3 documents in the foreground set, each with the term "hello":
doc1: hello shows up 2 times
doc2: hello shows up 1 time
doc3: hello shows up 4 times
Then "frequency of term in foreground set" would be 7, correct?
And "size of foreground set" would be 3? (since there are 3 docs? ...or would it be the total number of terms over the 3 docs)?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.