Similar document detection

avaneeshr · February 18, 2017, 3:04pm

Hi!
I have a large 100GB index which contains webpages (url, content, etc.).
A lot of these pages are from the same domain and are very similar (like 90% similar, with just a few modifications).
I am trying to implement a full-text search over these documents and currently have done de-duplication using Jaccard coefficient as a postprocessing step in my Java app. This eliminates the duplicates, but i want to do this using native elasticsearch methods, if possible (such as "dedup").
Is it possible to get rid of the deduplication logic from my app and get rid of similar results from elasticsearch itself?

using ver 2.4.0

Thanks in advance

warkolm · February 18, 2017, 10:10pm

There's not really anything like this natively.

system · March 18, 2017, 10:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicate documents detection in Elasticsearch Elasticsearch	4	2809	July 5, 2017
Near duplicate document detection Elasticsearch	2	1396	August 12, 2020
Indexing-time document deduplication Elasticsearch	6	2573	July 6, 2017
Help with aggregation to identify dups Elasticsearch	3	1079	March 4, 2019
Is there any way to de-duplicated documents based on the field? Elasticsearch	1	536	July 5, 2017

Similar document detection

Related topics