How to find duplicate documents containing super long text fields?


(Alex) #1

Hi, the main question is in the title. Elasticsearch version: 6.2

I want to say that I've already tried to find information for this task, and actually I've found some, but for one or another reason it doesn't work for documents containing massive strings in one text field.

As far as I get there are two most useful methods in finding duplicate docs:

  1. Using aggregations
    I've read about this method here

Difficulties in using it:
- we can aggregate by text fields only with fielddata enabled, but in references it is not recommended to enable it because "Fielddata can consume a lot of heap space, especially when loading high cardinality text fields. Once fielddata has been loaded into the heap, it remains there for the lifetime of the segment. Also, loading fielddata is an expensive process which can cause users to experience latency hits." So is it really that bad? I guess it is for my index with like ~50000 docs (for now) with huge strings in one field (kibana freezes for some time to show even 1 doc)
- in theory we can aggregate by .raw keyword subfields, but as I've found here "Lucene doesn't allow terms that contain more than 32k bytes."

  1. More Like This Query
    I've read about it here and of course in references for ES 6.2

Difficulties in using it:
- it can find duplicates only for one doc per request, not great, but ok
- it actually find similar not duplicate docs and I think it was made to do so, but can it be used to find exact same docs?
- I can't use keyword analyzer because Lucene doesn't allow terms that contain more than 32k bytes
- the main parameter in this query in this situation is minimum_should_match, which I tried to set to "100%", but for some reason it doesn't work, when I use "minimum_should_match": "100%", it works like it ignores it (I'm using Kibana for testing), actually in fact it works only with natural numbers like "minimum_should_match": 200, but not with negative integers and so on, like it is written here
This will work fine:

GET another_test_index/_search
{
  "query": {
    "more_like_this": {
      "fields": ["text_field"], 
      "like": "I love oranges",
      "min_term_freq" : 1,
      "min_doc_freq": 1,
      "max_query_terms" : 12,
      "minimum_should_match": 3
    }
  }
} 

but it won't work as expected with "minimum_should_match": "100%"

So without "minimum_should_match": "100%" it's kinda impossible to find duplicates using More Like This Query (keyword analyzer won't work);
even with "minimum_should_match": "100%", there is small detail - even if all terms are equal, it doesn't mean that we've found the exact same document, I guess it's not so big deal when working with really huge strings in one doc, but not nice

So:

  1. Are there any other methods which I can use in my situation?
  2. How bad can enabling fielddata for text fields here be?
  3. Why "minimum_should_match": "100%" doesn't work for More Like This Query or maybe I don't get something?

Any help or advice would be much appreciated!


(Mark Harwood) #2

If you’re looking to find exact duplicate text you can compute a hash for a large text string then index and aggregate on that.
The problems with this are

  1. there’s a very small chance of false positives
  2. even a single byte difference will result in a different hash
    Because of 2) another technique is to compute multiple hashes for a text, using different sections of text to hash

(Alex) #3

Yes, we thought about computing and indexing hash, but I was assuming that ES can solve this problem without it :slight_smile:
I guess it can't then, but thank you for advice and for this technique


(Mark Harwood) #4

More detailed discussion of the techniques here: Plagiarism detection


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.