Find duplicate docs by multi fields

Hi,

I need to find duplicate docs which is determined by multi fields,
and I want to run this operation daily.
Right now I have 2 solutions:

  1. Script query where I concate the fields into one field and do term aggregation on it.
{
  "size": 0,
  "aggs": {
    "duplicate_docs": {
      "terms": {
        "script": "params['_source']['id1'] + params['_source']['id2'] + params['_source']['id3'] + params['_source']['id4'] + params['_source']['id5'] + params['_source']['date']",
        "min_doc_count": 2,
        "size": 1000
      },
      "aggs": {
        "dup_docs": {
          "top_hits": {
            "size": 5
          }
        }
      }
    }
  }
}

I used the doc[field].value first but I got some wrong results (I gusse because of the analyzer) so I switched to _source.
I also have a date field that I concate. Can this field cause performance problems when joining it to normal string?
This operation seems to be a little bit heavy on cpu and memory depends on the number of docs I have.

2, I read about copy_field but it looks like A waste to create A concated field only for monitoring needs.

  • I also tried nested aggregation but its to complicate and I dont think its a good solution.

Is there any other way to find duplicates? Or anyway to improve the script?

Thanks.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.