Hi,
I need to find duplicate docs which is determined by multi fields,
and I want to run this operation daily.
Right now I have 2 solutions:
- Script query where I concate the fields into one field and do term aggregation on it.
{ "size": 0, "aggs": { "duplicate_docs": { "terms": { "script": "params['_source']['id1'] + params['_source']['id2'] + params['_source']['id3'] + params['_source']['id4'] + params['_source']['id5'] + params['_source']['date']", "min_doc_count": 2, "size": 1000 }, "aggs": { "dup_docs": { "top_hits": { "size": 5 } } } } } }
I used the doc[field].value first but I got some wrong results (I gusse because of the analyzer) so I switched to _source.
I also have a date field that I concate. Can this field cause performance problems when joining it to normal string?
This operation seems to be a little bit heavy on cpu and memory depends on the number of docs I have.
2, I read about copy_field but it looks like A waste to create A concated field only for monitoring needs.
- I also tried nested aggregation but its to complicate and I dont think its a good solution.
Is there any other way to find duplicates? Or anyway to improve the script?
Thanks.