How is the score of Significant Term aggregation calculated?

ahrtr · July 24, 2018, 8:44am

It's a little confusing how is the score of Significant Term aggregation calculated. Another thing is about the criteria of highlighting significant items. I know that a term is considered significant if there is a noticeable difference in the frequency in which a term appears in the subset and in the background, but it's unclear on the criteria of the "noticeable difference".

Can anybody please clarify these questions? Thanks.

Mark_Harwood · July 24, 2018, 9:33am

From the docs :

The scores are derived from the doc frequencies in foreground and background sets. The absolute change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the relative change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.

JLH is the default scoring algo but there are others and they all use the same 4 numbers as input:

size of foreground set
frequency of term in foreground set
size of background set
frequency of term in background set.

This video provides a visual demo of scores on various queries and the positive effects of sampling.

ahrtr · July 24, 2018, 11:54pm

@Mark_Harwood Thanks for the info.

So the score = (foregroundPercent - backgroundPercent) * (foregroundPercent/ backgroundPercent), correct?

Is the foregroundPercent always greater than the backgroundPercent in this case?

Mark_Harwood · July 25, 2018, 8:05am

For the JLH score, yes. Positive correlations generally tend to be the ones people want rather than negative correlations....

"like this product? Then here's some others you'll hate...."

If you want the negative correlations try the mutual information heuristic.

ahrtr · July 25, 2018, 8:17am

Thank you !

mind_scratch · August 15, 2018, 2:00pm

If I have 3 documents in the foreground set, each with the term "hello":
doc1: hello shows up 2 times
doc2: hello shows up 1 time
doc3: hello shows up 4 times

Then "frequency of term in foreground set" would be 7, correct?
And "size of foreground set" would be 3? (since there are 3 docs? ...or would it be the total number of terms over the 3 docs)?

Mark_Harwood · August 15, 2018, 2:16pm

Then "frequency of term in foreground set" would be 7, correct?

Nope. Doc frequencies are the number of docs that contain the word at least once.

system · September 12, 2018, 2:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
JLH score calculation Elasticsearch	4	2668	February 19, 2018
Significant_terms aggregation with sampling Elasticsearch	2	216	January 20, 2023
Perform significant terms aggregation in Elastic search based on sum of a field rather than count if documents Elasticsearch	2	390	December 10, 2019
JLH score for significant terms Elasticsearch	3	3446	July 5, 2017
Aggregation across multiple indexes/indices - significant terms Elasticsearch	5	623	March 17, 2022

How is the score of Significant Term aggregation calculated?

Related topics