Find duplicate docs by multi fields

Guylot · February 18, 2018, 1:16pm

Hi,

I need to find duplicate docs which is determined by multi fields,
and I want to run this operation daily.
Right now I have 2 solutions:

Script query where I concate the fields into one field and do term aggregation on it.

{
  "size": 0,
  "aggs": {
    "duplicate_docs": {
      "terms": {
        "script": "params['_source']['id1'] + params['_source']['id2'] + params['_source']['id3'] + params['_source']['id4'] + params['_source']['id5'] + params['_source']['date']",
        "min_doc_count": 2,
        "size": 1000
      },
      "aggs": {
        "dup_docs": {
          "top_hits": {
            "size": 5
          }
        }
      }
    }
  }
}

I used the doc[field].value first but I got some wrong results (I gusse because of the analyzer) so I switched to _source.
I also have a date field that I concate. Can this field cause performance problems when joining it to normal string?
This operation seems to be a little bit heavy on cpu and memory depends on the number of docs I have.

2, I read about copy_field but it looks like A waste to create A concated field only for monitoring needs.

I also tried nested aggregation but its to complicate and I dont think its a good solution.

Is there any other way to find duplicates? Or anyway to improve the script?

Thanks.

system · March 18, 2018, 1:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch duplicate documents multi fields search error Elasticsearch	1	735	March 2, 2018
duplicateNames with multiple fields? Elasticsearch	8	5289	March 21, 2018
Find Duplicate records in data Elasticsearch	7	19014	July 5, 2017
Finding duplicate documents or its count based on some field names Elasticsearch	5	5950	July 6, 2017
Using multi terms in duplicate search? Elasticsearch	2	643	July 11, 2017

Find duplicate docs by multi fields

Related topics