I have a large 100GB index which contains webpages (url, content, etc.).
A lot of these pages are from the same domain and are very similar (like 90% similar, with just a few modifications).
I am trying to implement a full-text search over these documents and currently have done de-duplication using Jaccard coefficient as a postprocessing step in my Java app. This eliminates the duplicates, but i want to do this using native elasticsearch methods, if possible (such as "dedup").
Is it possible to get rid of the deduplication logic from my app and get rid of similar results from elasticsearch itself?
using ver 2.4.0
Thanks in advance