Duplicate Field Aggregate crashes system

Erik_Miller · April 28, 2017, 2:58pm

I am trying to run a simple aggregation that returns all documents that have a duplicate field. The use case is that the field is considered an ID field and there should never be more than one specific field value across all documents.

I have tried running this against the entire document set, but then realized if there are many duplicates this may just be too much data.

So I tried running it with a filter against a know subset size that includes duplicates and only has 50 documents and it did not return after minutes worth of processing and one of the nodes heap size ballooned causing me to restart the node.

There ~50 million docs, spread across 6 nodes each with 32G memory, search query is below.

Any help would be appreciated.

{
  "filter": {
    "bool": {
      "must": [
        {
          "term": {
            "progservId": 96891
          }
        },
        {
          "term": {
            "schedDate": "2017-05-02"
          }
        }
      ]
    }
  },
  "aggs": {
    "duplicateNames": {
      "terms": {
        "field": "tsId",
        "size": 0,
        "min_doc_count": 2
      }
    }
  }
}

Christian_Dahlqvist · April 30, 2017, 8:36am

As your duplicate documents can be spread out across multiple shards and basically all values of the duplicate field need to be compared centrally, this is very expensive. If this is something you will need to do regularly, I would recommend indexing using the duplicate field as a routing key. This will make all documents with the same duplicate field to end up in the same shard, allowing you to use the shard_min_doc_count parameter (set to 2) to identify duplicates already at the shard level, which should be much more efficient.

system · May 28, 2017, 8:44am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using aggregation in Elasticsearch to find duplicate data in the same index Elasticsearch painless	6	14123	January 22, 2021
duplicateNames with multiple fields? Elasticsearch	8	5173	March 21, 2018
Help with aggregation to identify dups Elasticsearch	3	1079	March 4, 2019
Finding duplicate documents or its count based on some field names Elasticsearch	5	5886	July 6, 2017
Min doc sub aggregation (find duplicates) Elasticsearch	1	482	October 7, 2017

Duplicate Field Aggregate crashes system

Related topics