I am trying to run a simple aggregation that returns all documents that have a duplicate field. The use case is that the field is considered an ID field and there should never be more than one specific field value across all documents.
I have tried running this against the entire document set, but then realized if there are many duplicates this may just be too much data.
So I tried running it with a filter against a know subset size that includes duplicates and only has 50 documents and it did not return after minutes worth of processing and one of the nodes heap size ballooned causing me to restart the node.
There ~50 million docs, spread across 6 nodes each with 32G memory, search query is below.
Any help would be appreciated.
{
"filter": {
"bool": {
"must": [
{
"term": {
"progservId": 96891
}
},
{
"term": {
"schedDate": "2017-05-02"
}
}
]
}
},
"aggs": {
"duplicateNames": {
"terms": {
"field": "tsId",
"size": 0,
"min_doc_count": 2
}
}
}
}