We have a ElasticSearch index which contains millions of documents which are generated by user's interaction with our product. So, these documents represent organic events. When we index these documents, we de-dupe the similar documents (based on hash-collision) and store the count of de-dupes for each unique doc.
I want to use significant terms aggregation on a field. The problem is that since the documents are de-duped, the results I am getting back from significant term aggregation are not good. If I arbitrarily break up a document into multiple similar documents (for e.g. if the the de-dupe count for a doc is 200, create 40 similar documents, 1 per 5 occurrences), then the quality of the results improve. But I don't want to do that, since I know this would lead to data explosion in my index.
Is there a way I can make use of the de-dupe count I have already stored in the index to control the bg_count & doc_count in significacnt_terms? I believe doing this might solve my problem.