Control bg_count and doc_count in significacnt_terms

chirag · March 3, 2019, 8:59am

We have a ElasticSearch index which contains millions of documents which are generated by user's interaction with our product. So, these documents represent organic events. When we index these documents, we de-dupe the similar documents (based on hash-collision) and store the count of de-dupes for each unique doc.

I want to use significant terms aggregation on a field. The problem is that since the documents are de-duped, the results I am getting back from significant term aggregation are not good. If I arbitrarily break up a document into multiple similar documents (for e.g. if the the de-dupe count for a doc is 200, create 40 similar documents, 1 per 5 occurrences), then the quality of the results improve. But I don't want to do that, since I know this would lead to data explosion in my index.

Is there a way I can make use of the de-dupe count I have already stored in the index to control the bg_count & doc_count in significacnt_terms? I believe doing this might solve my problem.

Mark_Harwood · March 3, 2019, 9:53am

I understand the motivation but, no, that is not currently possible. You’d have to use regular ‘terms’ aggregations (one in the regular agg scope and on under a ’global’ agg) along with a ‘sum’ agg of the dedupeCount field. This would get the foreground and background stats and you’d need to do your own significance calculation in the client.
Because the trimming of insignificant terms happens late (in your client) you may struggle with response sizes. If the significant stuff you hope to find is low frequency and hidden amongst many other insignificant terms you might need to break the analysis up into multiple requests (see terms agg “partitioning”) just to avoid memory limits on each response.

The reason a ‘script’ aggregation with some custom painless code can’t be used to “trim early” at the data nodes is because it can’t access the background stats required to compute significance - you’d need a custom Java plugin to do this aggregation.

system · March 31, 2019, 9:53am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Significant Terms Aggregation to many fields Elasticsearch	2	464	February 16, 2018
Detail questions about significant_terms aggregation Elasticsearch	1	322	July 6, 2017
Significant terms aggregation with non tokenized text Elasticsearch	2	471	July 6, 2017
Perform significant terms aggregation in Elastic search based on sum of a field rather than count if documents Elasticsearch	2	390	December 10, 2019
Significant terms buggy due to wrong bg_count Elasticsearch	8	750	March 18, 2020

Control bg_count and doc_count in significacnt_terms

Related topics