Control bg_count and doc_count in significacnt_terms

We have a ElasticSearch index which contains millions of documents which are generated by user's interaction with our product. So, these documents represent organic events. When we index these documents, we de-dupe the similar documents (based on hash-collision) and store the count of de-dupes for each unique doc.

I want to use significant terms aggregation on a field. The problem is that since the documents are de-duped, the results I am getting back from significant term aggregation are not good. If I arbitrarily break up a document into multiple similar documents (for e.g. if the the de-dupe count for a doc is 200, create 40 similar documents, 1 per 5 occurrences), then the quality of the results improve. But I don't want to do that, since I know this would lead to data explosion in my index.

Is there a way I can make use of the de-dupe count I have already stored in the index to control the bg_count & doc_count in significacnt_terms? I believe doing this might solve my problem.

I understand the motivation but, no, that is not currently possible. You’d have to use regular ‘terms’ aggregations (one in the regular agg scope and on under a ’global’ agg) along with a ‘sum’ agg of the dedupeCount field. This would get the foreground and background stats and you’d need to do your own significance calculation in the client.
Because the trimming of insignificant terms happens late (in your client) you may struggle with response sizes. If the significant stuff you hope to find is low frequency and hidden amongst many other insignificant terms you might need to break the analysis up into multiple requests (see terms agg “partitioning”) just to avoid memory limits on each response.

The reason a ‘script’ aggregation with some custom painless code can’t be used to “trim early” at the data nodes is because it can’t access the background stats required to compute significance - you’d need a custom Java plugin to do this aggregation.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.