How is the score of Significant Term aggregation calculated?

It's a little confusing how is the score of Significant Term aggregation calculated. Another thing is about the criteria of highlighting significant items. I know that a term is considered significant if there is a noticeable difference in the frequency in which a term appears in the subset and in the background, but it's unclear on the criteria of the "noticeable difference".

Can anybody please clarify these questions? Thanks.

From the docs :

The scores are derived from the doc frequencies in foreground and background sets. The absolute change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the relative change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.

JLH is the default scoring algo but there are others and they all use the same 4 numbers as input:

  • size of foreground set
  • frequency of term in foreground set
  • size of background set
  • frequency of term in background set.

This video provides a visual demo of scores on various queries and the positive effects of sampling.

2 Likes

@Mark_Harwood Thanks for the info.

So the score = (foregroundPercent - backgroundPercent) * (foregroundPercent/ backgroundPercent), correct?

Is the foregroundPercent always greater than the backgroundPercent in this case?

For the JLH score, yes. Positive correlations generally tend to be the ones people want rather than negative correlations....

"like this product? Then here's some others you'll hate...."

If you want the negative correlations try the mutual information heuristic.

Thank you !

If I have 3 documents in the foreground set, each with the term "hello":
doc1: hello shows up 2 times
doc2: hello shows up 1 time
doc3: hello shows up 4 times

Then "frequency of term in foreground set" would be 7, correct?
And "size of foreground set" would be 3? (since there are 3 docs? ...or would it be the total number of terms over the 3 docs)?

Then "frequency of term in foreground set" would be 7, correct?

Nope. Doc frequencies are the number of docs that contain the word at least once.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.