ES 6 SignificantTextAggregation's DeDuplicatingTokenFilter usage

rkrombho · November 22, 2017, 4:23pm

Hi,

I really like the new Significant Text Aggregation in ES 6 but what I really love is the filter_duplicate_text parameter backed by the DuplicateByteSequenceSpotter.

We have a strong interest in detecting duplicate term sequences in our Indices and I was wondering if we could somehow make use of this golden nugget.

My use-case would be more like Detecting and Flagging documents with duplicate Byte Sequences.

I don't know much about the ES codebase but is the DeDuplicatingTokenFilter a real (and so far undocumented) Token Filter or is it only meant to be used in the context of Aggregations?

Would be interesting to know if there are any plans to implement the DeDuplicatingTokenFilter in other aggregations or queries (e.g. more_like_this).

Cheers
Robert

Mark_Harwood · November 22, 2017, 5:04pm

Thanks, I'm particularly happy with that feature too

I think the challenge with using it in an indexing content (as opposed to its use in analyzing search results) is that similar documents are less likely to be discovered in close succession to each other. It relies on maintaining a window of content which is used to retain sequences previously seen. To be used on a continuous stream of data the following concerns come into play:

We would need to implement a pruning policy to avoid memory bloat, taking care to retain sequences that look to be useful/promising duplicates
We may be unable to spot duplicates that are spread far apart. In a search context, similar documents rank similarly so there's a better chance of detecting duplicates within a window of only the top matching docs.

Cheers

rkrombho · November 22, 2017, 5:43pm

Ah, I didn't notice the windowing part at the first look.
Okay that makes sense.

Thank you

system · December 20, 2017, 5:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.