Detail questions about significant terms aggregation


(Valentin Pletzer) #1

Hi,

first of all: I really love the new significant terms aggregation as well
as the cardinal count aggregation. Thanks a lot!

I have some detail questions:

  • What is bg_count (I assume background count) but what is the meaning of
    it?
  • At first I thought the score values are between 0 and 1 but there are
    much bigger values. Can anyone give me a rough explanation?

Cheers
Valentin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0aa544b5-a2a4-40ae-986d-03955a27ea60%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Hannes Korte) #2

Hi Valentin,

  • What is bg_count (I assume background count) but what is the meaning of
    it?

The bg_count is the number of documents, which contain the term in the
whole index (not just in the search result).

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html

  • At first I thought the score values are between 0 and 1 but there are
    much bigger values. Can anyone give me a rough explanation?

You can see the code of the computation here:

https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/aggregations/bucket/significant/InternalSignificantTerms.java#L94

This is a summarized version of the formula:

double subsetProb = #relative frequency in the search result#;
double supersetProb = #relative frequency in the whole index#;
double absoluteProbChange = subsetProb - supersetProb;
if (absoluteProbChange <= 0) {
return 0;
}
double relativeProbChange = (subsetProb / supersetProb);
return absoluteProbChange * relativeProbChange;

I guess in the future there will be support for other scorings like
mutual information, chi squared or information gain.

Best regards,
Hannes

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/534828FE.20806%40hkorte.com.
For more options, visit https://groups.google.com/d/optout.


(Valentin Pletzer) #3

Hi Hannes,

thanks for the info. Scoring like mutual information sound fun.

Best regards,
Valentin

On Friday, April 11, 2014 7:40:14 PM UTC+2, Hannes Korte wrote:

Hi Valentin,

  • What is bg_count (I assume background count) but what is the meaning
    of
    it?

The bg_count is the number of documents, which contain the term in the
whole index (not just in the search result).

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html

  • At first I thought the score values are between 0 and 1 but there are
    much bigger values. Can anyone give me a rough explanation?

You can see the code of the computation here:

https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/aggregations/bucket/significant/InternalSignificantTerms.java#L94

This is a summarized version of the formula:

double subsetProb = #relative frequency in the search result#;
double supersetProb = #relative frequency in the whole index#;
double absoluteProbChange = subsetProb - supersetProb;
if (absoluteProbChange <= 0) {
return 0;
}
double relativeProbChange = (subsetProb / supersetProb);
return absoluteProbChange * relativeProbChange;

I guess in the future there will be support for other scorings like
mutual information, chi squared or information gain.

Best regards,
Hannes

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2490b9bd-4531-4964-9f21-6e18d2a92c7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4