Find duplicate docs by multi fields


(Guy Lot) #1

Hi,

I need to find duplicate docs which is determined by multi fields,
and I want to run this operation daily.
Right now I have 2 solutions:

  1. Script query where I concate the fields into one field and do term aggregation on it.
{
  "size": 0,
  "aggs": {
    "duplicate_docs": {
      "terms": {
        "script": "params['_source']['id1'] + params['_source']['id2'] + params['_source']['id3'] + params['_source']['id4'] + params['_source']['id5'] + params['_source']['date']",
        "min_doc_count": 2,
        "size": 1000
      },
      "aggs": {
        "dup_docs": {
          "top_hits": {
            "size": 5
          }
        }
      }
    }
  }
}

I used the doc[field].value first but I got some wrong results (I gusse because of the analyzer) so I switched to _source.
I also have a date field that I concate. Can this field cause performance problems when joining it to normal string?
This operation seems to be a little bit heavy on cpu and memory depends on the number of docs I have.

2, I read about copy_field but it looks like A waste to create A concated field only for monitoring needs.

  • I also tried nested aggregation but its to complicate and I dont think its a good solution.

Is there any other way to find duplicates? Or anyway to improve the script?

Thanks.


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.