Duplicate Field Aggregate crashes system

I am trying to run a simple aggregation that returns all documents that have a duplicate field. The use case is that the field is considered an ID field and there should never be more than one specific field value across all documents.

I have tried running this against the entire document set, but then realized if there are many duplicates this may just be too much data.

So I tried running it with a filter against a know subset size that includes duplicates and only has 50 documents and it did not return after minutes worth of processing and one of the nodes heap size ballooned causing me to restart the node.

There ~50 million docs, spread across 6 nodes each with 32G memory, search query is below.

Any help would be appreciated.

{
  "filter": {
    "bool": {
      "must": [
        {
          "term": {
            "progservId": 96891
          }
        },
        {
          "term": {
            "schedDate": "2017-05-02"
          }
        }
      ]
    }
  },
  "aggs": {
    "duplicateNames": {
      "terms": {
        "field": "tsId",
        "size": 0,
        "min_doc_count": 2
      }
    }
  }
}

As your duplicate documents can be spread out across multiple shards and basically all values of the duplicate field need to be compared centrally, this is very expensive. If this is something you will need to do regularly, I would recommend indexing using the duplicate field as a routing key. This will make all documents with the same duplicate field to end up in the same shard, allowing you to use the shard_min_doc_count parameter (set to 2) to identify duplicates already at the shard level, which should be much more efficient.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.