I really like the new Significant Text Aggregation in ES 6 but what I really love is the filter_duplicate_text parameter backed by the DuplicateByteSequenceSpotter.
We have a strong interest in detecting duplicate term sequences in our Indices and I was wondering if we could somehow make use of this golden nugget.
My use-case would be more like Detecting and Flagging documents with duplicate Byte Sequences.
I don't know much about the ES codebase but is the DeDuplicatingTokenFilter a real (and so far undocumented) Token Filter or is it only meant to be used in the context of Aggregations?
Would be interesting to know if there are any plans to implement the DeDuplicatingTokenFilter in other aggregations or queries (e.g. more_like_this).
Thanks, I'm particularly happy with that feature too
I think the challenge with using it in an indexing content (as opposed to its use in analyzing search results) is that similar documents are less likely to be discovered in close succession to each other. It relies on maintaining a window of content which is used to retain sequences previously seen. To be used on a continuous stream of data the following concerns come into play:
We would need to implement a pruning policy to avoid memory bloat, taking care to retain sequences that look to be useful/promising duplicates
We may be unable to spot duplicates that are spread far apart. In a search context, similar documents rank similarly so there's a better chance of detecting duplicates within a window of only the top matching docs.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.